About the RISC stuff
It's usually 4 cycles per instruction, but it's the "classic RISC pipeline" where 4 are on the go at once, so really "1 per cycle" with little exception.
You can go a bit further if you interleave load/store with reg-to-reg instructions (or provide them interleaved) and do 2 at a time. Unfortunately (AND I'D LOVE TO BELIEVE ME) - I have no experience optimising or looking at dissassemble stuff for such hardware.
But yeah "classic RISC" with "double" that stuff with the load store interleaving can do 2 instructions / cycle throughput sustained.
It might be a 5 step pipeline, but throughput remains. And this includes the bubble hazard shit workarounds you can do.
I think this example might be mips, it's the one without the branch delay slot. It's the classic Undergrad textbook "Computer organisation and design - the hardware/software interface" (there are so many) - it's on my shelf.
-----ADDENDUM with stuff I *do* deal with------
Check out this filthy porn site: https://en.wikichip.org/wiki/intel/microarchitectures/sandy_bridge_(client)#Individual_Core
The best you can hope for is to keep each port going per cycle and Skylake has 8 I think. I deal with SandyBridge mostly though.
Don't get me started on how they've fucked up AVX.
But yeah the 6 ports are your (probably widest so anti-)bottle-neck with issuing stuff. As much as I'd love to show off, but I don't have time:
SandyBridge can have a surprisingly high amount of instructions inflight, but it has to retire/commit them in order, best you can hope for is 6 (4 ops, but they can be fused).
Decoding is the biggest issue with x86-64 stuff, as a rule of thumb it costs you 40% in something, be it power, die area (of the actual core of the core), or performance if you neglect it, and that's why Intel never stood a chance w/ phones/tablets.
I'd really love to show off because "good with computers" sounds like I can set up a printer, but trying to rush it makes me look like a noob quoting from a presentation, read the page. [removed "joke" about what "this stuff" is to me]
BTW total in-flight issues is between 100 and 200 as a "rule of thumb" - this works well because of the variability of what even your "common" compiler-generated instructions do WRT "uops" and shit.
You have to dig deep in manuals and have very specific bits of code to talk about and write a kernel module to poke with and set up special performance registers - it's a mess. So it's not really worth looking for more than this rule.
LOL no wonder spectre happened!