Re: Most Surprised
Spectacularly misinformed post...
The vast majority of high-performance ARM processors - including Apple's - use all the features you're bitching about. Branch prediction is basically an absolute necessity for any high-performance design - high clock requires a long pipeline; without a branch predictor, a bubble is created in the pipeline which leads to a stall during branch resolution. This is a major performance issue, and one that a branch predictor with high accuracy resolves. As for your comment about real-time applications, a worst-case time is not impossible to predict; microarchitectures have documented branch mispredict recovery times, usually on the order of 10-20 cyc. This, by the way, is basically no less deterministic than cores with caches, which you seem to have no problem advocating for - if a load hits cache, it might take 5 cyc to complete; if it misses cache and hits main memory, it might take 150 cyc.
Decode/microcode: Decode doesn't mean what you think it is; it's an essential part of any CPU design, RISC or CISC, as decode controls things like "what functional unit does this op go to?" and "what operands does this op use?" Microcode was mentioned nowhere. I suspect you're confusing use of micro-ops - ie, internal basic operations in a long fixed-length format - with microcode, ie lookup of certain complex operations in a microcode ROM at decode time. The first does not imply the second. Most fast processors have a complex decoder for operations that are more efficient to break into 2-3 uops, and this doesn't hit microcode. The M1 core may or may not have microcode - since it doesn't mention a ucode engine in the decode slides, and it wasn't mentioned in the presentation (I was there) I suspect it does not. Even in ARM there are ops that can be beneficial to crack into multiple uops - reg+reg addressing for instance (one uop for the reg+reg calculation, one uop for the load/store.) There are even more examples in other RISC ISA's - take a look at the manual for a modern PowerPC core, for instance, and check out the number of ops that are cracked or microcoded!
As for out-of-order execution, it's an extremely helpful technique for exposing memory-level parallelism (by, for instance, continuing to run code during a cache miss) for surprisingly little additional overhead. Additionally, it takes the number of architectural registers out of the equation by renaming them onto a broad set of physical registers - as a result, in an OoO machine, architectural register count is almost never a hard limitation; false dependencies are eliminated and instructions run when their operands become available, not when a previous non-dependency operation completes so its scratch register can be used. This can improve power efficiency at a given performance target, because an in-order machine generally has to clock higher to get the same level of performance.
Again, Apple does these things too - they have an aggressively out-of-order machine with branch prediction register renaming too (in fact, more aggressively out-of-order than the M1 in the article!) http://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/4 has a nice summary of Apple's current uarch.
Please do more research before making this kind of post...