Re: Wrong way round
As a very informal rule of thumb (but as with all rules of thumb, useful - I only put this here to start a pissing contest with some of the tools here :P)
x86-64 has around a "40% crap tax" - that is you pay 40% of some measure to account for specific things. For example 40% extra power for the nasty decoding problems, 40% lower throughput if you skimp on those (which is why the atoms sucked so bad. They still needed active cooling and were easily out-performed by a 2012 phone, but that's a long time ago I don't remember when I got the netbook and joined the revolution of "so much battery life, so portable - tiny keyboard is unusable, and it's unusably crap")
In that trade-off is 40% more power usage.
I stress it's a rule of thumb, SandyBridge (although the idea existed before as a trace cache) and possibly Neelhalm (the one before) use an "LSD" - loop stream detector (it was botched for Skylake, discovered by the Haskell people (naturally) and de-activated in an patch) the idea is that this stores loops entirely that are small enough, allowing you to switch off the decoders (a huge saving!) so tight loops just run near perfectly. That sounds pretty weird right? If the decoders could keep it fed why bother? Power savings.
Lastly, why this tax?
x86-64 instructions can be up to 15 bytes in length, and as short as 1 byte. So if I give you some quantity of bytes you cannot mark out where instructions begin and end without decoding at least their lengths, but the use of prefix bytes and all kinds of other crap means that this "pre-decoding" if you will is basically a full decode. Once you've done this step the register masks are there to be syphoned off and stuff, it's a really really big penalty.
RISC architectures traditionally (kinda have to have to be RISC to be honest) have nice uniform instruction lengths, eg all 4 bytes. VLIW and EPIC are sub-types of RISC and have longer (like 32 bytes is the lower end VLIW) - and alignment requirements, that is the address of the first byte of the instruction must be divisible by say 4 in this example
So given any chunk of bytes, I can say "if an instruction is here it starts where the address ends in 00" - next is 8 bytes, that'd end in "000" - job done.
I *believe* but it's not my area (see above) that ARM chips can switch modes. they have a short 2 byte instruction form that covers loads of common cases - the chip must be switched between modes - another I've heard of requires an instruction to be 4 bytes, either 2x 2-byte-short instructions or one full size 4 byte (I've heard of something similar with 8 byte instructions accepting 2x 4 bytes for common cases)
But you get the gist, very easy. For x86(-64) this affects everything, branching for example, "where's the start of the next instruction" - nope you can't just add 4. This needs to be known for branch histories too. Decoding is also an absolute nightmare. This is why RISC emulators run reasonably well (for pure emulation now) compared to emulating x86-64 (yes brute force lets us run some stuff like this practically), but x86-64 is way way way more difficult.
What happened to RISC you might ask? If it's so good right? Well there were like 4 or 5 RISC arches and suddenly there were zero, they all thought Itainium would be a good idea (enjoy looking that up) - no one mentioned "but hey guys, doesn't that force an NP-hard problem onto the compiler?" and "doesn't that you can't be sure code you wrote ages ago will work even reasonably well on later versions because of architecture changes?"
But anyway.... It was a bit before my time, but only PowerPC was left standing sort of.
Itanium was supposed to be Intel's 64 thing, that's why it's called IA64 (Intel Arch 64) IA32 is x86, and AMD64 is what we call "x86-64" sometimes because AMD realised "hey backwards compatibility FTW!"
Now you asked about power, speaking purely of the CPU and not the Larabee derived Xeon Phi accelerators (They're like .... gimped/Atom-esque cores with AVX-512 bolted onto them, crap CPUs but decent vectorisers, this sits in a rare niche where a GPU is even today too not-general-purpose to do it, so it needs the CPU parts)
40% savings to power - or you could have 40% extra of the "uncore" (weirdly this means "the core of the core" kinda thing) transistors to use for not paying the tax, you get the idea. This 40% transistors doesn't include the cache BTW - purely "uncore".
That's a big deal.
Furthermore the time of "wait 6 months, then it'll be faster" (hardware would get faster) is long over. We're now deep into the "scaling out" side of things and some algorithms are probabilistic (bad term on my part, NOT "probabilistic algorithms" - something else - see next paragraph) , yes there's a lot of work not geared for this, but there's a lot of this work too! 40% is nearly half, you could run another core almost with that.
For "algorithms that are probabilistic" I meant for example a certain search engine beginning with "G" (at least, I imagine it's common. It's easy) actually sends out 3 of every web-query it gets, and shows the results from the first one to come back and ignores the rest. This hugely cuts down on latency,
It's a hell of a saving and as I whined about above, I've wanted to see it for a long time.