I hope the kernel patches are intelligent enough to ONLY apply the 'slowness feature' when absolutely necessary, and not simply slow EVERYONE down to be "fair" (or, on the part of devs, LAZY).
A fundamental design flaw in Intel's processor chips has forced a significant redesign of the Linux and Windows kernels to defang the chip-level security bug. Programmers are scrambling to overhaul the open-source Linux kernel's virtual memory system. Meanwhile, Microsoft is expected to publicly introduce the necessary changes …
Who knows. With Linux of course we can tell. With Windows, possibly benchmarks are our friend if MS are keeping stum about the matter.
From a security point of view it would be better to leave things as they are if the hardware is not effected; better to be running mature code than to be running what seems like a major update out together in a bug hurry.
The other issue here is what MS does regarding Windows 7. It would not surprise me in the least if they tried a clever/efficient patch for Windows 10 and a simpler (and slower) bodge-job for Windows 7/8. Still, I guess we'll find out soon enough. They'd also better make sure that the changes only apply to Intel machines. I don't want MS to arbitrarily slow down my AMD PC as a result of this - you'll note that AMD submitted a Linux patch to ensure their CPUs weren't caught up in this, will MS do the same?
“you'll note that AMD submitted a Linux patch to ensure their CPUs weren't caught up in this, will MS do the same?”
We’ll find out, I guess - the two paths MS can take being “do we release patches that only impact performance on Intel” or “are we working together so closely with Intel that the competition authorities should be involved”?
By the sound of things it looks like the bug allows users (non-admin) to see the guts of the OS. This means, that once published, hackers will be given details as to how to take over the computer if you haven't upgraded (security risk).
What is naturally to fear is that the first generation code (patch) will be to block these details. New, ad-hoc code means there may be stability issues (return of the blue screen ??). If the OS developers run all their security test vectors, no old holes should open up. The question I think you have is will there be some new hole, I'm not sure anyone can know for sure - we can only demand that old bugs don't return.
What we can expect is for code to run slower. If I interpret the description well, the more threads you have the slower the code should run, ie it should impact JAVA code (scripts/interpreted code) more than C++(compiled code). It'll expose individual coding styles and change coders tactics for performance optimization.
If written intelligently, the kernel would detect the CPUID and supported instruction sets, and apply itself appropriately.
Ideally the kernel would flag all Intel chips, and once Intel fixes their problem, Intel releases a new feature flag that indicates a fix to the problem that the kernel could also flag off of.
Kernel memory is mapped to user mode processes to allow syscalls (a request to access hardware/kernel services) to execute without having to switching to another virtual address space. Each process runs in its own virtual address space, and it's quite expensive to switch between them, as it involves flushing the CPU's Translation Lookaside Buffer (TLB, used for quickly finding the physical location of virtual memory addresses) and a few other things.
This means that, with every single syscall, the CPU will need to switch virtual memory contexts, flushing that TLB and taking a relatively long about of time. Access to memory pages which aren't cached in the TLB takes roughly 200 CPU cycles or so, access to a cached entry usually takes less than a single cycle.
So different tasks will suffer to different extents. If the process does much of the work itself, without requiring much from the kernel, then it wont suffer a performance hit. But if it uses lots of syscalls, and do lots of uncached memory operations, then it's going to take a much larger hit.
That's what I make of it from understanding of it, which might not be 100% correct.
"So different tasks will suffer to different extents." But the hardware...?
"The downside to this separation is that it is relatively expensive, time wise, to keep switching between two separate address spaces for every system call and for every interrupt from the hardware."
So now we'll have a little tax charged for every interrupt - *every* *interrupt*. How much software do you run that doesn't use disk or network or any I/O?
This was not the financial micro-transaction future I was thinking of.
I think we need to return to PDP11, where you had an alternative set of memory management registers for program and supervisor (kernel) mode. When you issued the instruction to trigger the syscall, the processor switched the mode, triggering an automatic switch to the priv. mode registers, mapping the kernel to execute the syscall code..
This meant that it was not necessary to have part of the kernel mapped into every process.
IIRC, s370/XA and Motorola 68000 with MMU also had a similar feature. I do not know about the other UNIX reference platforms like VAX (the BSD 3.X & 4.X development platform) or WE320XX (AT&T's processor family used in the 3B family of systems - the primary UNIX development platform for AT&T UNIX for many years), but I would suspect that they had it as well.
I first came across the need to have at least one kernel page mapped into user processes on IBM Power processors back in AIX 3.1, where page 0 was reserved for this purpose. In early releases, it was possible to read the contents of page 0, but sometime around AIX 3.2.5, the page became unreadable (and actually triggered a segmentation violation if you tried to access it).
That's true. Having a address space change would have to disable speculative execution, because it would also have to try to predict which address space it would be in.
Actually, thinking about it, it still has to, because if the mapped page is protected from view, there still needs to be some mechanism to lift the protection to allow the speculative execution of the branch of code, before the decision is taken. But in theory, the results of the branch-not-taken should be discarded as soon as the decision is made, so that the information gathered could not be used. Maybe there is something in the combination of speculative execution and instruction re-ordering (not mentioned yet) which allows data to be extracted from later in the pipeline.
Maybe this is the problem, and if it is, it's probably a design flaw rather than a bug, Interesting.
The leak may not even be data, it could be as small as a timing change from the spec-ex path taking longer to fault, or not, allowing the attacker to probe the kernel space for not/valid pages - so defeating to an extent the kernel memory-map randomisation? Worse might be a spec-ex branch on spec-read secret data which affects timing similarly, without directly exposing the data itself.
Now that would be embarrassing to the Rust evangelists, who think their one trick pony solves every possible security issue! They'll suppress that idea with their usual fervor I bet. To me it's the height of vanity to think you can pre-define and therefore pre-solve every possible side channel and backdoor attack. The endless game of thief vs locksmith is...endless...
"the results of the branch-not-taken should be discarded"
From the and hint the problem isn't speculative execution as such, it's fetch hardware that reads memory before checking permission, presumably changing cache state irreversibly. It speculative execution disables the privilege violation because the code path is discarded there's no way to detect the event or take any remedial action like invalidating the tlb. However invalidating the tlb would leak address layout information anyway!
The correct thing is blocking the fetch ops completely while still potentially raising an exception if that part is taken. Better, raise an exception anyway. Which appears to be the amd approach. Intel look like they saved some transistors and maybe gained a tiny speed advantage without thinking it through.
I think we need to return to PDP11
While the elegance of the PDP-11 design is beyond dispute, it wouldn't scale to the performance of current systems. Notably, its memory management was very simple: 8 base and length registers for each execution mode. As soon as you go much beyond a 64k address space it becomes infeasible to have a full set of mapping registers on the CPU and you're forced down the TLB road.
What might well make sense is if it were possible to run the (key parts of the) kernel in a "protected real" mode (i.e. with no virtual address translation at all, but with some form of memory protection using keys or limit registers. If you don't have enough physical memory to contain a kernel, you're not going to make much progress anyway. And it's only one of many areas in which improving performance with caching of one kind or another leads to potential anomalies with specific execution patterns.
Not that any speculation (sorry!) of that kind helps with the current debacle - but it does illustrate how the growing complexity and unauditably of processor designs has largely been overlooked until recently while we've been principally worried about software security.
"What might well make sense is if it were possible to run the (key parts of the) kernel in a "protected real" mode"
If the kernal is in the same address space as user code and using the same pipe then nothing changes. Either a seperate CPU/cache for kernal code and without speculative execution or scrap the whole model.
Basically no kernal code should be run before security has been validated and that means stalls in kernal code execution.
"If the kernal is in the same address space as user code and using the same pipe then nothing changes. Either a seperate CPU/cache for kernal code and without speculative execution or scrap the whole model.\"
Even in user space page permissions exist. There's nothing intrinsic to a flat, shared address space that stops a CPU enforcing all permissions at the thread/page/exe page level, all the way down to prefetch and cache access. Separate cache/memory systems per ring is a high price to pay to replace access control logic in the memory system. Dumping cache state is an even higher price to pay for not having that logic.
Not really. "key parts of the" kernel are interrupt service functions... and they need to be able to address the entire memory.
Real mode sucks.
What is really needed is better architecture.
One set of registers for each interrupt, kernel, supervisor, user
at a MINIMUM. Then add separate cache for each level - though not necessarily all being the same size.
The problem with the PDP11 method is the kernel code has to do fiddly read-userland-from-kernal operations to get the call's parameters and data. If the kernal is paged in and running in kernel memory at &o004010 and the caller has its parameters in userland at &o004010 the kernal can't read them by just reading from &o004010 as that would read kernel memory, it has to use MoveFromPreviousInstructionspace (MFPI) instructions to copy the parameters to local (kernel) space, and MoveToPreviousInstructionspace (MTPI) instructions to copy stuff back to userland, such as loaded data.
That is true, but for volume data moves was mitigated by DMA from disk directly into memory-mapped buffers in the process address space, using the UNIBUS address mapping registers, which allowed raw DMA transfers to addresses outside of the kernel address space.
Of course, not all PDP11 models had the UNIBUS (or, I presume a similar QBUS) feature, but pretty much everything after an 11/34 would have. I had an unusual 11/34e that also had 22-bit addressing, which made it much more useful.
Only for the parameters.
Once past that, the kernel could use a single page table entry to map to whatever user memory was needed.
For the intel processors.... this bug just about kills any microkernel which was already slow, now becomes 20% slower.
Microsofts hybrid microkernel is going to have fits with this.
Back in the 1970s I was asked to make a PDP11 system go faster. It was running DEC's real time OS, viz RSX11. The critical part of the program needed to make a lot of OS calls. I arranged for that region of the program to be mapped into kernel space (using a 'connect to interrupt' facility) and got my speedup. The cost was a section of high-risk code.
There is a reason why interrupts used to return control to the kernel. It may be that a disk transfer has finished, thus allowing some more important program to resume. In more general terms, the context has changed and the system should adjust, particularly if several programs are running. The PDP11 could respond to interrupts within microseconds, perhaps to capture volatile data from fleeting registers on special hardware; but it could be a long time returning to the point of interrupt.
The reason why PDP-11 could respond so quickly to interrupts was this facility to switch address spaces without having to save the register context. On other architectures, in order to take an interrupt, the first thing that you need to do is save at least some of the address registers, and then restore them after you've handled the interrupt.
IIRC, the PDP-11 not only had duplicate address mapping registers, but also had a duplicate set of some of the GP registers, so you had to do pretty much nothing to preserve the process context that's just been interrupted. This is what made the interrupt handling very fast.
The time to return from the interrupt was entirely down to the path length of the code handling the interrupt. The actual return mechanism was as quick as the calling mechanism. There were unofficial guidelines about how long your interrupt code should take, which I believe were conditioned by the tick length for the OS. If you took too long, you would miss a clock-tick, which would result in the system clock running slow.
In addition, I also believe that there were a small number of zero page vectors that were left unused by either UNIX or RSX11/m (the version of RSX I was most familiar with) that allowed you to add your own interrupt handlers for certain events.
"I think we need to return to PDP11, where you had an alternative set of memory management registers for program and supervisor (kernel) mode."
There are already plenty of non x86 derivatives out there that don't have this bug, all that's required is folks to make the move. :)
Would be nice if vendors updated their benchmark results in the light of a 30% performance hit, so we can get an apples-apples comparison against processors that don't suffer from this particular fault.
I think that you should say were plenty of non-x86 processors out there. There really aren't any more, with just AMD (which is an x86 derivative, but may not be affected), Power, zSeries, ARM, and the tail end of SPARC, and I suppose Itanium (just) being around.
You could also say, I suppose, that there is a MIPS processor around still, but you'd have to buy it from the Chinese.
A lot of other architectures never made the switch to 64 bit (although Alpha was always 64 bit). Architectures we've lost include Alpha, PA-RISC, VAX, all Mainframe apart from zSeries, DG Eclipse, Motorola 68000 and 88000, Nat. Semi. 32XXX, Western Electric 32XXX, various Honeywell, Burroughs and Prime systems, and various Japanese processors from NEC, Hitachi and Fujitsu.
This is largely the cost of wanting cheap, commoditized hardware. You end up with one dominant supplier, and suffer if they get something (or even a lot of things) wrong.
"I think that you should say were plenty of non-x86 processors out there."
There are still plenty out there, not all of them will be a viable alternative for your application...
"There really aren't any more, with just AMD (which is an x86 derivative, but may not be affected),"
In my view AMD share the same problem as Intel: The x86 ISA (64bits, extensions, warts and all) are simply too complex to test properly. It's a scalability limit in the design space - and this isn't a new problem - it goes back decades. We are seeing bugs span multiple steppings AND generations of product line as a matter of routine. The x86 vendors are physically unable to sell us a fully functional chip even if we pay top dollar for it.
As I see it, as customers, we have no alternative but to go to other ISAs over the long run - simply to get a working chip without the "feature-itis" imposed by 30+ years worth of workarounds.
It seems to be an architectural bug from what I'm seeing - probably related to the need for speed being the highest priority for marketing while security takes a back seat. We never worried about "security" in the old days of processor design, we were far more worried about incorrect access causing a crash and that took priority - with the result that modern security issues were mostly nonexistent.
Biting the hand that feeds IT © 1998–2019