Vector Optimization – Stride

This distance between memory locations that separates the elements to be gathered into a single register is called the stride. A stride of one unit is called a unit-stride. This is equivalent to sequential memory access.

The position in memory of adjacent elements in a vector may not be sequential. A vector processor can handle strides greater than one, called non-unit strides, using only vector load and vector store operations with stride capability. Thus vectors are able to access nonsequential memory locations and reshape them to a dense structure.

Caches inherently deal with unit stride data; increasing block size can help reduce miss rates for data with unit stride, but increasing block size can even have a negative effect for data that are accessed with non-unit strides. While blocking techniques can solve some of these problems, the ability to access data efficiently that is not contiguous remains an advantage for vector processors.

The stride value must be computed dynamically, since the size of the matrix may not be known at compile time or may change for different executions of the same statement. The vector stride can be put in a general-purpose register. LVWS fetches the vector with stride into a vector register. SVWS stores the vector back to memory.

Once we introduce non-unit strides, it becomes possible to request accesses from the same bank frequently. When multiple accesses contend for a bank, a memory bank conflict occurs, thereby stalling one access. As long as the stride and the number of banks are relatively prime to each other and there are enough banks to cover bank access latency, consecutive accesses proceed in parallel.

In the following example, there are 4 memory banks, each of size 8. The vertical dimension indicates time.

Using a stride (= 8) equal to the bank size will lead to repeated access to the same location. This results in a stall. Note that memory access is costly and the processor has to stall until the data is fetched before moving on to the next stride location.

Using a stride (= 4) which is a factor of the bank size will also lead to repeated access to the same location. This also results in a stall. However, this is a tad more efficient than the previous configuration.

Using a stride (= 5) which is a relatively prime to the bank size spreads the memory access. By the time the same memory location is accessed again, the previous instruction is completed by the processor. This avoids any stalls.

A stall will occur if

    \begin{equation*}      \frac{\text{Number of banks}}{GCD(\text{Stride, Number of banks})} < \text{Bank busy time}  \end{equation*}

Leave a Reply

Your email address will not be published. Required fields are marked *