The most straightforward way to get more performance out of a processing unit is to speed up the clock (setting aside, for the moment, fully asynchronous designs, which one doesn't find in this space for a number of reasons). Some very early computers even had a knob to continuously adjust the clock rate to match the program being run.
But there are, of course, physical limitations on the rate at which operations can be performed. The act of fetching, decoding, and executing instructions is rather complex, even for a deliberately simplified instruction set, and there is a lot of sequentiality. There will be some minimum number of sequential gates, and thus, for a given gate delay, a minimum execution time, T(emin). By saving intermediate results of substages of execution in latches, and clocking those latches as well as the CPU inputs/outputs, execution of multiple instructions can be overlapped. Total time for the execution of a single instruction is no less, and in fact will tend to be greater, than T(emin). But the rate of instruction execution, or issue rate, can be increased by a factor proportional to the number of pipe stages.
The technique became practical in the mid-1960s. The Manchester Atlas and the IBM Stretch project were two of the first functioning pipelined processors. From the IBM 390/91 onward, all state-of-the-art scientific computers have been pipelined.
Not every instruction requires all of the resources of a CPU. In "classical" computers, instructions tend to fall into categories: those which perform memory operations, those which perform integer computations, those which operate on floating-point values, etc. It us thus not too difficult for the processor pipeline to thus be further broken down "horizontally" into pipelined functional units, executing independently of one another. Fetch and decode are common to the execution of all instructions, however, and quickly become a bottleneck.
Once the operation of a CPU is pipelined, it is fairly easy for the clock rate of the CPU to vastly exceed the cycle rate of memory, starving the decode logic of instructions. Advanced main memory designs can ameliorate the problem, but there are always technological limits. One simple mechanism to leverage instruction bandwidth across a larger number of pipelines is SIMD (Single Instruction/Multiple Data) processing, wherein the same operation is performed across ordered collections of data. Vector processing is the SIMD paradigm that has seen the most visible success in high-performance computing, but the scalability of the model has also made it appealing for massively parallel designs.
Another way to ameliorate the memory latency effects on instruction issue is to stage instructions in a temporary store closer to the processor's decode logic. Instruction buffers are one such structure, filled from instruction memory in advance of their being needed. An instruction cache is a larger and more persistent store, capable of holding a significant portion of a program across multiple iterations. With effective instruction cache technology, instruction fetch bandwidth has become much less of a limiting factor in CPU performance. This has pushed the bottleneck forward into the CPU logic common to all instructions: decode and issue. Superscalar design and VLIW architectures are the principal techniques in use today (1998) to attack that problem.