Good stuff
Good summary of what we've known in the games industry for about five years (you wouldn't know it to listen to our "luminaries" like Carmack and Sweeney, though; they're still praying for salvation based on their old faith - which is hardly surprising given their legacy codebases, but still).
Effective large-scale parallel programs (e.g. games) seem to split in a recursive way. The top layer may be serial, then there may be a parallel layer underneath, then each part of the parallel layer may be serial again, with each serial part implemented by a parallel algorithm. Thus can we divvy up hundreds of cores within a single program. An example might be:
Serial: Calculate each output frame of the game in turn
Parallel frame: Run simulation and rendering in parallel
Serial simulation: First calculate physics, then run AI
Parallel physics: Break the sparse matrix into chunks across many processors
Serial chunks: Run a standard single-threaded algorithm to evaluate a chunk
Unfortunately, the "productivity programmers" don't have a clear layer that belongs to them. Domain-specific knowledge is inserted at all levels.
We want to avoid having the domain guys think too hard. Much customization (insertion of domain-specific knowledge) can be to do with data structures and "shaders" - small pieces of functional code which do raw data processing. But they will sometimes need to understand the parallel context this stuff runs in. The aim is to make this "sometimes" be as infrequent as possible, but this in itself requires the domain people to change their paradigm e.g. they can no longer just call out to other objects at will, because that doesn't scale. Their job still gets harder. I don't see how to avoid this.
Ironically, the lowest-level performance programmers still need to write the best single-threaded code for a given task. They are the only ones to escape the paradigm shift, despite being the best people to take it on board.