So the Sales of Good act in England and Scotland state that if you can prove there is a manufacturer flaw in a purchased item, you have 4 or 5 years to claim a fix.
Should be easy enough to prove in this case....
It's not just Intel facing a legal firestorm over its handling of the Spectre and Meltdown CPU design flaws – AMD is also staring at a growing stack of class-action complaints related to the chip vulnerabilities. At least four separate lawsuits have now been filed against the California-based processor slinger, alleging …
however they are still functional as CPU's so unless you have( or can demonstrate you had, a particular security need that argument may not fly.
Although you cannot mitigate the flaw itself, how you consume your machines will influence how likely you are for an effective exploit to be manifested.
I am not excusing AMD if they did indeed know about the flaw, but I too smell a lot of people looking for "free money"
Indeed, that seems to be all it is.
I wonder how many of the legal parasites covering the suits get irritated when their porn viewing glitches due to a processor not quite being able to handle the load. I bet if someone told them that soon they'll have to be watching it at 5fps so their processor is secure from some incredibly hard to exploit hardware fault they might not be so self righteous about it.
This whole issue has arisen through having to play catch up to bloated OS's and libraries, VMs running on top of VMs, bloated web stack nonsense and a host of other cycle sapping technologies that placed convenience above efficiency in the assumption that hardware would keep pace. Well it did - with a price. In a way this might in the long run be a good thing - slower hardware means the dead wood cut-n-paste-from-stackoverflow coders will get the boot and people who know what they're doing will be hired instead plus MS and Apple will have redirect their efforts to things more important than more childish emojis and piss poor GUI redesigns.
That Intel and AMD chips and even the vastly dissimilar ARM chips have the same sort of flaw due to similar crappy implementation of the branch-prediction etc.
It's almost like people moved between companies and took their knowledge with them.
Or maybe their engineers all learnt from the same masters. It'd be sort of funny if the root cause was really some class at MIT or elsewhere.
For a lot of it the fix is "find some way to stop the CPU predicting past this point", though. It didn't resolve the underlying issues - just work around them by selectively disabling that feature, at an often-severe cost to performance.
Meltdown and Spectre reveal information due to a side-effect of how fast it is to fail to access data which may have been brought into cache through a speculative fetch. It is not branching that is the issue, so much as the RAM-to-cache fetches, which are a large part of the point of predicative execution.
The variable speed of memory operations due to whether or not data us in the cache was not considered to be a security issue, just a performance issue. It is that core assumption which is the biggest problem.
I wish I understood Spectre better so I could make a snarky comment about it, but from what I can tell, it's hard to see how branch prediction COULD have been leveraged "that way" in the first place, ya know?
So maybe it was just "great minds thought alike" only they all missed something, too...
I'm actually surprised that it took this long. A hit of 0.99% on AMD stock is like what...20 cents? They were holding the info so the software vendors could fix the flaws.
Now lets get technical. The branch predictor is 2-bit state machine, which holds numbers 0 to 3. If the branch for a particular instruction is taken, the counter counts up. If it's at 3, then it stays at three. If the predictor misses the prediction, then it counts down to 2. If it misses again, then it count's down to 1 and starts predicting that the branch is not taken. There is one for each conditional jump instruction. For i386/amd64, that's quite a few instructions. Research has shown that the 2-bit branch predictor is the best compromise between reaction time and accuracy.
These newer CPUs has fine-grain multithreading with resource allocation systems where each instruction has a profile of what hardware resources it needs to execute. The different areas of the CPU operate more or less independently of each other like an assembly line. Each instruction is allocated the hardware resources it needs to execute. When it reaches the end of the pipeline, the result is committed. If the result is needed by the next instruction, then it has to wait until it is available, or is forwarded back to previous stages of the pipeline.
I can see how this could have happened. For a toy 16-bit RISC pipeline CPU simulated in Verilog, it took 2 people over a month to design it from the ground up. Yes, I was in that project myself, and it was for an advanced computer architecture and organization class. And yes, it also had a 2-bit branch predictor. With billions of transistors, you have many people working on the design in different groups who do not necessarily talk to each other.
And yes, they do teach this stuff in the classroom.
All very good, but not the nub.
The problem is that speculative executions can load from forbidden memory locations, and only cause a violation exception if their results are committed. Loading - processing - throwing the data leaks the contents through clever timing analysis.
Fixing it is hard. Lessons need to be learnt from the cipher/encryption community, that complete separation between userland and kernel is required for true security.
"Fixing it is hard. Lessons need to be learnt from the cipher/encryption community, that complete separation between userland and kernel is required for true security."
Short of running kernel and userland code on completely seperate CPUs (not even cores - shared cache) then feel free to tell us how you'd have complete seperation on the same physical hardware with multiple concurrent code branches in the pipelines.
It's been a while since I had this level of detail, but you're slightly off in a couple of points in an otherwise very informative post.
1) The two-bit predictor is generally the level 1 predictor. At least some AMD products have had a more sophisticated predictor that kicks in after too many misses.
2) The prediction tables are not one-to-one with the addresses, but use part of it (and may hash). Aliasing is a thing.
3) The "speculative" part of speculative execution refers to the fact that results are sent back into the system before they are finally committed. If the execution units only worked on committed results, we have no speculation and no Spectre...
For those of you who want to understand the full range of Spectre class faults they come in 2 basic classes.
1) Exploits that use various techniques including speculative execution using branch cache seeding to determine the contents of in process memory which the process is allowed to access. Note that this memory is permitted access according to the processor design model so AMD have a strong defence. The exploit is basically used to break the security model of sandboxes that protect other scripts in the sandbox by preventing the execution of instructions accessing each other's memory. Speculative execution breaks this.
2) Exploits that use various techniques to exploit the basic shared nature of cached information. This applies to the in processor cached stuff (branch predictor, return cache, level 1 cache) and the across chip cached stuff (higher levels of cache). This caches information leaks and this is not preventable without making all cached information private to the process that owns it.
The second class is a serious unfixable problem but it is not obvious how it can be exploited. Just think about it, the only real fix is no shared cache, every cache line is private to the process that fetched it and the fact that it is present is private as well. This basically means a private cache for every core, no dual thread cores unless both threads of execution belong to the same process and a full cache flush on every context switch.
Do you want a Spectre vulnerable CPU or the performance of a machine from 20 years ago.
There are actually ways around this. If you put a prebuffer on your caches and only move the data into the cache proper when the underlying instructions commit, it would work. Assuming that the prebuffer access times are identical to the main cache, and only accessible on the same speculative branch. Expensive. Very expensive. But not as bad as flushing the caches for every process switch.
I am posting this here so it can't be patented later by patent trolls.
Other ways around spectre/cache attacks:
1. Partition the cache into a set of equally large subcaches. Fetches that can only be satisfied from an inaccessible (security-wise) cache are artificially slowed down to the speed of a miss even though it is fed from another fast subcache (to ensure consistency). Do this separately from the memory, branch and other prediction caches (reusing the subcache selection index may be done or not at the discretion of the chip designer). Notice that a privileged context (such as a kernel) can fully access the cached data of its subordinates and at full speed, but must not leak or get influenced by those caches (for example, cached branch predictions are not shared, but privileged access to subordinate memory will use the subcache of that subordinate context).
2. Assign a separate subcache to each combination of protection "ring" (x86 has about 7 rings) and process/vm at that level. For example, the ring 0 kernel of each VM is a combination. Each ring 3 process of each vm is a combination. Some kernels (such as the Windows kernel) have subdomains (called sessions) within the kernel. Document which CPU registers combine to form a "long but exact" identifier of each such combination. Use standard cache logic to choose which recently active combination identifiers get an active subcache and which ones will need to throw out another combination to run.
3. Keep the access permission bits of the architectural ("visible") memory manager data cached along with the address offsets, and do not speculatively execute memory accesses that are banned by the then-known permission bits. (Not doing this is the Meltdown flaw in Intel and some ARM CPUs). Instead stall the speculation or speculate a fault being executed.
4. As someone else mentioned, speculative fetches get stored in a microcache/queue from which they are committed to the real cache (with eviction of previous line contents) only when the fetch instruction is committed by the core. This implies additional signalling between core and caches. This needs to permeate all the way down the cache hierarchy, but is asynchroneous, with the cache committing trailing the core committing by several clock cycles.
5. Shared execution units (hyperthreading, core pairs etc.) are shared between different context combinations on a strictly even schedule, so neither process can sense how much the other combination is using a particular unit (such as the floating point barrel multiplier). Only when two hyperthreads (etc.) are running in the same context combination can the resources be divided based on actual need/activity. Again, the chip can optionally take advantage of the short combination identifier that chooses a subcache of one/all of the caches.
6. Seriously consider going back to predictable execution pipelines, like in old-school RISC or the original Pentium. Rules like: If an a simple arithmetic operation writes to register X, any instruction reading the result will not start running until 4 clocks later, but instructions that don't look at X can run in the meantime and can be automatically promoted to the head of the queue from within the next 8 instructions in a straight line. Or "conditional branches are predicted taken if they go to an earlier address (loops) otherwise not, except if you use conditional branch opcode Y which predicts the opposite, or if the direction is predicted from committed register values.". Or "indirect branches are only predicted if loaded from a committed register value and a memory address marked read only in the cached memory descriptors". Such rules require less transistors and should still offer a relatively high performance, though not as high as the sophisticated predictions that caused this mess.
7, As a stop-gap patch/modification of existing microarchitectures, only run the speculative speedups in the least privileged mode (ring 3, user), and flush it when returning to a different user mode process than the last one. This takes advantage of many modern systems having already moved most of their CPU load to ring 3 processes and the fact that most kernel-to-user transitions will be returns to the most recently running process.
8. Encourage OS-es that don't already have it to introduce the ability to set up slave processes without access to regular system calls (like Linux seccomp). This will allow browsers etc. to use those CPU-enforced sandboxes instead of manual sandboxing via interpreters.
Biting the hand that feeds IT © 1998–2019