Re: What else can a move to ARM bring ?
x86/amd64 already have them in the form of AVX et al, and ARM has NEON. While vaguely similar they're not the same though.
The reason for offloading to the GPU is because a GPU is a massively-parallel array of floating-point processors, and most of this "AI" stuff is simply massively-parallel low-precision computation...
It often makes it slower to bring it into the CPU, because shared address space means you have to keep the cache coherent. The "copy input data to coprocessor memory, run, copy result data back" approach is much faster as there generally isn't very much input or output data compared with the number of intermediate values.