AFAIK, pretty much all video codecs assume that the video to be compressed is 2D and intermediate frames only take account of the difference between one frame and the next. Both are reasonable simplifications if you want something that's fast to encode or decode, but they mean that a lot of exploitable structure is ignored. Another feature common to most codecs is that self-similarity within a frame is mostly ignored, with most focus being put on motion estimation as a way to compress inter-frame differences in common cases (eg, panning, moving objects within the frame).
If you think about algorithms that can turn images (or objects in them) into 3D approximations, this is a lot easier to do if you have a video camera attached to a vehicle (or carried) than if you present the algorithm with an unordered collection of stills of the same target from different vantage points. It's easier to reason about the relative motion of the camera between frames. It's going to be more smooth, and looking at a sequence of images it's going to be easier to divide up areas between static (modified only by relative viewpoint) and transient (moving objects passing through the frame).
If the cost of encoding isn't so much of a problem, you could apply interferometric analysis to a sequence of images. For the relatively fixed objects, you could build up a 3d approximation of those objects and generate a pixmap to skin them. Taking a sequence of images like this might also help to sharpen the image, hence cutting down on the amount of noise, leading to better compression. You can't sharpen single images, but you can with multi-sampling over time or slightly different viewpoints. To make interferometry work, you'd have to be able to adapt to things like focus and motion blur, detecting it on the way in (and tagging affected regions per frame) and adding it on the way out.
Videos also have various spatial self-similarities, besides the time-based ones. The most easily-exploitable option for compression is to assume that self-similar blocks will be neighbouring each other, and that's now most codecs work (mostly through compressing the palette across neighbouring blocks, AFAIK). If the codec tried representing areas as simple 3d meshes with pixmaps, then it could maintain a cache of these over an extended period. An algorithm would explicitly compress these mesh+pixmap objects based on their self-similarity. If a transient object moves across a surface, it wouldn't necessarily mean that the data about what's currently invisible due to the occlusion gets kicked out of the cache, meaning that once the transient object has passed, it should be cheap for the decoder to repair the "damage". Likewise with things like fast cuts, where the data for one bunch of frames can be re-used when the camera comes back to them a few seconds later rather than starting with a new key frame each time.
If encoding cost is no object, then you can try to reverse engineer lighting information from the original stream. When the contribution from lighting is removed from each area, you can compress the forest of mesh+pixmap cache objects much more efficiently. Or, you can use it to refine your idea of what a surface is by tesselating its original mesh and throwing out a lot of the pixmap data (which takes up a lot of space relative to a mesh + lighting model).
Going from (effectively) a simple block-based compressor to one with meshes, textures and lighting does, of course, make things a lot more costly for the decoder. Still, if there aren't too many light sources or reflectance, I could imagine a next-gen GPU managing to handle this. (Too much reflected light turns it into a generic ray tracer, which has very poor locality of memory references)
This sort of thing could handle fairly static objects, but there's also the problem of how to compress deformable objects like faces or the silhouettes of transient objects that aren't spatially modelled. Probably some completely different approach is warranted there.
This all sounds pretty pie in the sky, but getting an extra 30%-50% out of existing approaches probably won't be easy, IMO.