The goal of the multiple-issue processors is to allow multiple instructions to issue in a clock cycle. Multiple-issue processors come in three major flavors
- Very Long Instruction Word (VLIW) processors
- Statically scheduled superscalar processors. Although statically scheduled superscalars issue a varying rather than a fixed number of instructions per clock, they are actually closer in concept to VLIWs, since both approaches rely on the compiler to schedule code for the processor.
- Dynamically scheduled superscalar processors
Static Multiple-Issue/VLIW
An Issue Packet is the set of instructions that issues together in one clock cycle; the packet may be determined statically by the compiler or dynamically by the processor. This issue packet can be visualized as one large instruction with multiple operations. This view led to the original name for this approach: Very Long Instruction Word (VLIW) – A style of instruction set architecture that launches many operations that are defined to be independent in a single wide instruction, typically with many separate opcode fields.
Most static issue processors also rely on the compiler to take on some responsibility for handling data and control hazards. The compiler’s responsibilities may include static branch prediction and code scheduling to reduce or prevent all hazards.
Rather than attempting to issue multiple, independent instructions to the units, a VLIW packages the multiple operations into one very long instruction. Since the advantage of a VLIW increases as the maximum issue rate grows, we focus on a wider issue processor.
To keep the functional units busy, there must be enough parallelism in a code sequence to fill the available operation slots. This parallelism is uncovered by unrolling loops and scheduling the code within the single larger loop body.
Generating enough operations in a straight-line code fragment requires ambitiously unrolling loops, thereby increasing code size. Whenever instructions are not full, the unused functional units translate to wasted bits in the instruction encoding.
Different numbers of functional units and unit latencies require different versions of the code. This requirement makes migrating between successive implementations, or between implementations with different issue widths, more difficult than it is for a superscalar design.
Dynamic Multiple-Issue/Superscalar
Dynamic multiple-issue processors are also known as superscalar processors. An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle by selecting them during execution.
In superscalars, Dynamic pipeline scheduling chooses which instructions to execute in a given clock cycle while trying to avoid hazards and stalls. The pipeline is divided into three major units: an instruction fetch and issue unit, multiple functional units and a commit unit.
- The first unit fetches instructions, decodes them, and sends each instruction to a corresponding functional unit for execution.
- Each functional unit has buffers, called reservation stations, which hold the operands and the operation.
- As soon as the buffer contains all its operands and the functional unit is ready to execute, the result is calculated.
- When the result is completed, it is sent to any reservation stations waiting for this particular result as well as to the commit unit, which buffers the result until it is safe to release the result of an operation to programmer-visible registers and memory.
- The buffer that holds the results is called reorder buffer. Stores to memory must be buffered until commit time either in a store buffer or in the reorder buffer.
- Once a result is committed to the register file, it can be fetched directly from there, just as in a normal pipeline.
- The commit unit allows the store to write to memory from the buffer when the buffer has a valid address and valid data, and when the store is no longer dependent on predicted branches.
The reorder buffer and the reservation stations effectively implement register renaming.
When an instruction issues, it is copied to a reservation station for the appropriate functional unit. Any operands that are available in the register file or reorder buffer are also immediately copied into the reservation station. The instruction is buffered in the reservation station until all the operands and the functional unit are available. For the issuing instruction, the register copy of the operand is no longer required, and if a write to that register occurred, the value could be overwritten.
If an operand is not in the register file or reorder buffer, it must be waiting to be produced by a functional unit. The name of the functional unit that will produce the result is tracked. When that unit eventually produces the result, it is copied directly into the waiting reservation station from the functional unit bypassing the registers.
Since the instructions can be executed in a different order than they were fetched, this results in out-of-order execution. The results of pipelined execution are written to the programmer visible state in the same order that instructions are fetched. This conservative mode is called in-order commit. If an exception occurs, the computer can point to the last instruction executed, and the only registers updated will be those written by instructions before the instruction causing the exception.
By predicting the direction of a branch, a dynamically scheduled processor can continue to fetch and execute instructions along the predicted path. Because the instructions are committed in order, we know whether or not the branch was correctly predicted before any instructions from the predicted path are committed.
A speculative, dynamically scheduled pipeline can also support speculation on load addresses, allowing load-store reordering, and using the commit unit to avoid incorrect speculation.
Dynamic scheduling can handle cases better than compiler code scheduling at the cost of significant increase in hardware complexity.
- Cache misses in the memory hierarchy cause unpredictable stalls. Dynamic scheduling allows the processor to hide some of those stalls by continuing to execute instructions while waiting for the stall to end.
- If the processor speculates on branch outcomes using dynamic branch prediction, it cannot know the exact order of instructions at compile time, since it depends on the predicted and actual behavior of branches. Dynamic scheduling enables handling some cases when dependences are unknown at compile time.
- As the pipeline latency and issue width change from one implementation to another, the best way to compile a code sequence also changes. Dynamic scheduling allows the hardware to hide most of these details. Thus, users and software distributors do not need to worry about having multiple versions of a program for different implementations of the same instruction set.