Software? Or maybe hardware.
So instead of just fetching the data, you're going to catch the request (in software?), and defer the code waiting for the data until the data can be more efficiently fetched.
Just how many more very expensive context switches will this generate? And where are the other threads that can be dispatched once all of them are waiting for an 'efficient data fetch'. And how will that affect the latency of individual threads?
I'm sure that there are some highly threaded applications with unpredictable data flow where this could be a benefit, but on the brute-force codes that make up most HPC applications, which mostly process data in predictable ways, especially Fortran code where the standard dictate how data is stored in arrays, this is likely to be completely unneeded extra code that can only slow the total throughput.
I think I'll let the hardware cache pre-fetch hardware provide all the speedup most real 'Big Data' requires.