In my experience hiding memory complexity is a mixed blessing. To gain maximum performance I need to understand the architecture the code runs on, to avoid costly operations, as others have said. I will get my code up an running sooner, but it will not necessarily be as fast as hand-tuned code. Where I do see use is in getting more-or-less machine independent code up and running quickly. If it does not run fast enough, you can then tune it to get the most out of the hardware and reduce costly copying from one part of memory to another, and so avoiding bandwidth and latency problems.
Nothing beats better bandwidth and lower latency, of course, but that is something the hardware guys must do for us software guys (and I know they are working on it). If the GPU and CPU truly share memory (i.e. the memory is physically unified), many difficulties will drop away, but that is something to dream about, for now