Loop Unrolling - The Beard Sage

Loop Unrolling is a technique to get more performance from loops that access arrays, in which multiple copies of the loop body are made and instructions from different iterations are overlapped. During the unrolling process, the compiler introduces additional registers to eliminate dependences that are not true data dependences. This renaming of registers by the compiler or hardware is called register renaming. The renaming process eliminates the name dependences, while preserving the true dependences, thus allowing the compiler to better schedule independent instructions.

Math

Consider a piece of code (body) that needs to be looped $n$ times. The body can be unrolled $k$ times and the the loop can be run on the unrolled body $n/k$ times. If $n$ isn’t divisible by $k$ , the extra $n mod k$ needs to accounted for.

Example

Loop:   L.D       F0,0(R1)     ;F0=array element
        ADD.D     F4,F0,F2     ;add scalar in F2
        S.D       F4,0(R1)     ;store result
        DADDUI    R1,R1,#-8    ;decrement pointer
                               ;8 bytes (per DW)
        BNE       R1,R2,Loop   ;branch R1!=R2

Instead of decrementing the pointer R1 by 8, extra registers can be used to unroll the loop 4 times followed by a single decrement of 32.

Loop:	L.D 	F0,0(R1)	;load M[R1] in F0
	ADD.D 	F4,F0,F2
	S.D 	F4,0(R1)
	L.D 	F6,-8(R1)	;load M[R1-8] in F6
	ADD.D 	F8,F6,F2
	S.D 	F8,-8(R1)
	L.D 	F10,-16(R1)	;load M[R1-16] in F12
	ADD.D 	F12,F10,F2
	S.D 	F12,-16(R1)
	L.D 	F14,-24(R1)	;load M[R1-24] in F16
	ADD.D 	F16,F14,F2
	S.D 	F16,-24(R1)
	DADDUI 	R1,R1,-32	;decerment by 32
	BNE 	R1,R2,Loop

Decisions/transformations

To obtain the final unrolled code we had to make the following decisions and transformations:

Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for the loop maintenance code.
Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations (e.g., name dependences).
Eliminate the extra test and branch instructions and adjust the loop termination and iteration code.
Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address.
Schedule the code, preserving any dependences needed to yield the same result as the original code.

Limitations

The following effects limit the gains from loop unrolling

A decrease in the amount of overhead amortized with each unroll.
Code size limitations. For larger loops, the code size growth may be a concern particularly if it causes an increase in the instruction cache miss rate.
Compiler limitations. Aggressive instruction scheduling can cause register pressure.

Math

Example

Decisions/transformations

Limitations

Leave a Reply Cancel reply