Parallel code easily transfered to very different architecture?
Let me guess, they can easily parallelize adding two arrays together, or doing matrix vector stuff optimally. This covers some very important bases, but some parallel code needs to be rethought rather than just recompiled when porting to a very different architecture.
We have code which does not use matrix-vector stuff, and works best (40x speed up on 64 cores) on fairly coarse grained, shared memory, parallel architectures. We still have not managed to make a distributed memory version (working on it), and are struggling with an OpenCL version for GPUs (working on it with GPU gurus).
Every time I have heard people claim to have tools that take all the hard work out of parallel programming, they show me examples like "add these 10^9 numbers to another bunch of 10^9 numbers". These tools can indeed take a lot of the hard work out of parallel computing, but not all, by quite a long way.