You could look at a big chunk of the grid schedulers: condor, platform, mesos and say "quelle difference?", but there are some
* designed to place work close to the data: your code can ask for specific machines & racks, with the scheduler trying to place it there, but if you say "best effort" then it will do it as close as it can network wise. This lets us run Hadoop without the high-cost SAN networks and so make storing petabytes of data affordable.
* designed for algorithms that have to handle failure. MapReduce does this by splitting up the work, retrying failed jobs, recognising slow machines and re-issuing the work -and even blacklisting the slow boxes. Those slow ones are the enemy as these stragglers slow everything down. Apache Tez can do checkpoints, then roll back to them. The Streaming algorithms need to replay the streams, which is a different problem.
If you do go back to the 1980s era massively parallel designs, some of the architectures do look familiar. Is the scale that's different -a scale that makes failures a fact of life that everything has to handle, rather than a disaster that needs someone to be paged and your on-site HDD replacements (for which you pay a lot for) wheel out. Even so -there are lessons there that we should learn from. After all, aren't VMs and their hypervisors just descendents of VM/360 -which had billing in from the outset too.