One of the reasons for low levels of vectorization is the presence of conditionals (IF statements) inside loops. IF statements introduce control dependencies into a loop.
for (i = 0; i < 64 i = i + 1) {
if (X[i] != 0) {
X[i] = X[i] * 2;
}
}
This loop cannot normally be vectorized because of the conditional execution of the body; however, if the inner loop could be run for the iterations for which , then the subtraction could be vectorized.
Mask registers essentially provide conditional execution of each element operation in a vector instruction. The vector-mask control uses a Boolean vector to control the execution of a vector instruction. When the vector-mask register is enabled, any vector instructions executed operate only on the vector elements whose corresponding entries in the vector-mask register are one. The entries in the destination vector register that correspond to a zero in the mask register are unaffected by the vector operation. Clearing the vector-mask register sets it to all ones, making subsequent vector instructions operate on all vector elements.
Consider the following snippet of code.
for (i = 0; i < 64 i = i + 1) {
if (a[i] >= b[i]) {
c[i] = a[i]
} else {
c[i] = b[i]
}
}
The above code goes through the following masking processes to populate c
.
The transformation to change an IF statement to a straight-line code sequence using conditional execution is called if conversion.
Masking introduces an overhead – conditionally executed instructions still require execution time when the condition is not satisfied. Nonetheless, the elimination of a branch and the associated control dependences can make a conditional instruction faster (faster than using scalar mode) even if it sometimes does useless work. Vector instructions executed with a vector mask still take the same execution time, even for the elements where the mask is zero.
Vector processors make the mask registers part of the architectural state and rely on compilers to manipulate mask registers explicitly. GPUs get the same effect using hardware to manipulate internal mask registers that are invisible to GPU software. In both cases, the hardware spends the time to execute a vector element whether the mask is zero or one.