Loop Unrolling is a technique to get more performance from loops that access arrays, in which multiple copies of the loop body are made and instructions from different iterations are overlapped. During the unrolling process, the compiler introduces additional registers to eliminate dependences that are not true data dependences. This renaming of registers by the compiler or hardware is called register renaming. The renaming process eliminates the name dependences, while preserving the true dependences, thus allowing the compiler to better schedule independent instructions.
Math
Consider a piece of code (body) that needs to be looped times. The body can be unrolled times and the the loop can be run on the unrolled body times. If isn’t divisible by , the extra needs to accounted for.
Example
Loop: L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
DADDUI R1,R1,#-8 ;decrement pointer
;8 bytes (per DW)
BNE R1,R2,Loop ;branch R1!=R2
Instead of decrementing the pointer R1 by 8, extra registers can be used to unroll the loop 4 times followed by a single decrement of 32.
Loop: L.D F0,0(R1) ;load M[R1] in F0
ADD.D F4,F0,F2
S.D F4,0(R1)
L.D F6,-8(R1) ;load M[R1-8] in F6
ADD.D F8,F6,F2
S.D F8,-8(R1)
L.D F10,-16(R1) ;load M[R1-16] in F12
ADD.D F12,F10,F2
S.D F12,-16(R1)
L.D F14,-24(R1) ;load M[R1-24] in F16
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,-32 ;decerment by 32
BNE R1,R2,Loop
Decisions/transformations
To obtain the final unrolled code we had to make the following decisions and transformations:
- Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for the loop maintenance code.
- Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations (e.g., name dependences).
- Eliminate the extra test and branch instructions and adjust the loop termination and iteration code.
- Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address.
- Schedule the code, preserving any dependences needed to yield the same result as the original code.
Limitations
The following effects limit the gains from loop unrolling
- A decrease in the amount of overhead amortized with each unroll.
- Code size limitations. For larger loops, the code size growth may be a concern particularly if it causes an increase in the instruction cache miss rate.
- Compiler limitations. Aggressive instruction scheduling can cause register pressure.