Re: @DougS: @M Gale @AC 17th Aug 17:52
The way that the 40 bit addressing works on a 32 bit ARM is by the use of segment registers allowing you to offset the virtual address space for a process into more that 4GB of memory. It's not new technology, and has been a cornerstone of the instruction sets of processors since the mid-1970's.
The first architecture I saw address extension done was the 16-bit PDP-11, which had it's address space stretched from 16 to 18 and then to 22 bits in different models. I do not know the ins and outs of Intel's PAE, but I suspect that it is something similar. The Power processor family also does something similar for it's virtual address space, although it does not need it to stretch the address space. Most other modern processors (those designed in the last 30 years) do something similar to support virtual addressing (but not necessarily for address extension).
The basic method involves breaking up the virtual address space into chunks called segments, and then adding a real-address offset to the base address (normally designated as a page number) in the address decoding hardware. This allows a process to see a linear address range scattered over a larger possibly non-contigious address space. The impact to the code-writer is ZERO. There is nothing that needs to be done for a user-land process to cope with this technique. All multi-tasking OS's have done this for what seems like forever.
It does make the OS have to a bit more work every time you start or context switch a process (it has in some way to manipulate the segment registers - it's different in different architectures), but it's well understood what needs to be done, and has been a standard technique. And it is perfectly possible to write the OS itself to work in a virtual linear address space (an example was the 32-bit AIX kernel running on 64-bit RS64 and later Power processors), where the OS is in control of manipulating the segment registers for itself, as well as for all of the other processes. The 32-bit kernel could manage 64 bit processes, with more than 4GB of real memory on the system, which when I explained it used to puzzle people for whom the 32-bit to 64-bit migration in Windows seemed like a huge deal.
The major limitation to this is although the system may have more memory than the size of an address, it can only be used in chunks determined by the width of an address. So for example, an individual process in an ARMv7 with 40 bit LPAE can only address 4GB of the address space, even though the architecture will support 1TB of real memory. But of course, you can have more than one process, allowing you to utilise all the available memory. And as a side effect, you have the ability to share pages across multiple processes for in-core shared libraries, shared memory segments, and memory mapped-files.
This is not even a problem for the OS, because all the writers have to do is to keep at least one segment free, and then manipulate the segment register to allow the OS to see any of the real memory. Of course, it can't see all of memory at the same time, but it can get access to any of the memory.
The issue of whether 64 bit addresses will add any more inefficiency over 32 bit addresses is all to do with whether half-word aligned load and stores can be done natively. On some architectures, performing a half-word operation (for example a 32 bit load or store on a 64 bit machine) requires loading an entire 64 bit word, and then masking and shifting the required part of the word to obtain the correct half word value. This may be microcoded, but in some architectures had to be done by the program itself. This is slower, and on some architectures, the decision about whether to 'waste' 32 bits of memory verses the performance costs of half-word operations was a difficult decision.
I would have to research the ARMv7 and ARMv8 ISA to know whether this is the case, although I would welcome someone in the know to provide an answer.
Whether floating point load or store operations can be done in units other than the word-length is different from architecture to architecture. For example in Power 6, it was necessary to load a floating point value through a GP register (or two in the case of a double-word FP value), and then move it to a floating point register. For Power6+ and Power7, it is possible to directly load from memory to a floating-point register, allowing you to do double-word FP loads (128 bits) in a single load operation. This decouples the FP processor from the natural word size of the CPU.