Re: How ?
Simplest way of thinking about this is:
You want to execute one instruction. That instruction can be divided into several tasks - fetch, decode, execute, write. Suppose executing the whole instruction takes N seconds, always.
If the processor is "single stage", it will perform one instruction every N seconds, with most of the circuitry idle as it waits for something to do:
--->time
F>D>E>W
.............F>D>E>W
If you manage to design the processor (and the program) so that each of the instruction tasks listed above can be done independently, you have a pipeline 4 stages deep. In that case, you can issue a new instruction every N/4 seconds and more of the circuitry is active at any given time. Big win. In reality instruction interdependence and jumps may force the processor to "flush the pipeline", i.e. discard the partially executed instructions, which evidently slows throughput. See in particular "vector processing".
--->time
F1>D1>C1>W1
..>F2>D2>C2>W2
.....>F3>D3>C3>W3
.......>F4>D4>C4>W4
You can now deepen the pipeline by dividing the tasks into subtasks to issue even more instructions per N. Depending on your expected workload, this may or may not make sense.
In the limit, you would get a processor that works asynchronously, without a central clock, where each logic gate does its work as soon as all its inputs have been set.
This has nothing to do with overall clock speed, though as frequency increases you cannot reliably give a good clock signal to all of the chip area "at the same time", so you are forced to compartimentalize anyway.