CUDA playing catch up
The allocation abstraction will actually benefit performance on hybrid (CPU / GPU) systems since main / device memory are shared between the two processors groups. In those situations there is no need for the copy. Looking at the hybrid chip systems (Tablets / Mobile Phones) this would be of huge benefit here.
Another poster is correct in regards to super-computer or other "big" systems with separate memories for GPU / CPU in regards to hand optimization. My impression (as an outsider to this field) is that CUDA primarily dominates / is used in scientific applications which execute on big systems rather than mobile devices. Lay coders would primarily be concerned with OpenCL which can run on everything (not just Nvidia GPUs) if they where optimizing their C code.
That said the new generation of C++ GPU programming APIs (C++ AMP, etc) actually abstract much of the problem of explicit memory management by providing C++ wrapper classes over the memory arrays. I think the NVidia Thrust API (which is a C++ wrapper over CUDA) does something similar - although I have no experience with that API.
My opinion - having used the C++ GPU programming APIs on some toy projects - is that one should be getting out of the business (as a developer) of explicitly memory copies. The C++ level abstractions are much more elegant to work with and (in the long term) provide better opportunities to easily optimize applications based on the hardware the program is running on.