Vector Architecture - The Beard Sage

Vector architecture is a variant of SIMD (single instruction multiple data), a single instruction can launch many data operations. The programmer continues to think sequentially yet achieves parallel speedup by having parallel data operations.

Vector architectures grab sets of data elements scattered about memory, place them into large, sequential register files, operate on data in those register files, and then disperse the results back into memory. A single instruction operates on vectors of data, which results in dozens of register–register operations on independent data elements.

Since vector loads and stores are deeply pipelined, the program pays the long memory latency only once per vector load or store versus once per element, thus amortizing the latency over, say, 64 elements. Vector programs strive to keep memory busy.

Memory System

A microprocessor has a set of vector registers. Each vector register is a fixed-length bank holding a single vector. In the figure below, a vector register points to a vector of length $16$ . These $16$ memory locations are accessed through an address generator.

Instructions

Vector instructions operate on many elements simultaneously. Vector designs use slow but wide execution units to achieve high performance at low power. The independence of elements within a vector instruction set allows scaling of functional units without performing additional costly dependency checks, as superscalar processors require.

Vectors naturally accommodate varying data sizes (hardware multiplicity). One view of a vector register size is 64 64-bit data elements, but 128 32-bit elements, 256 16-bit elements, and even 512 8-bit elements are equally valid views.

Vector instructions are used to increase performance of simple in-order scalar processors without greatly increasing energy demands and design complexity. Programs that run well on complex out-of-order designs can be expressed more efficiently as data-level parallelism in the form of vector instructions.

Execution

When the compiler is able to produce vector instructions for a block of code and the resulting code spends much of its time running in vector mode, the code is said to be vectorizable. Loops can be vectorized when they do not have dependences between iterations of a loop, which are called loop-carried dependencies.

Consider the followings sequence of instructions

LD    VR <- A[3:0]
ADD   VR <- VR, 1
MUL   VR <- VR, 2
ST    A[3:0] <- VR

In an array processor, every ADD must wait for a LD, and every MUL must wait for the ADD. On the vector processor, each vector instruction will only stall for the first element in each vector, and then subsequent elements will flow smoothly down the pipeline. Thus, pipeline stalls are required only once per vector instruction, rather than once per vector element. This forwarding of element-dependent operations is called chaining.

An array processor executes same operation at the same time while a vector processor executes different operations at the same time. An array processor has different operations in the same space while a vector processor has the same operation in the same space.

Chaining

Chaining allows a vector operation to start as soon as the individual elements of its vector source operand become available: The results from the first functional unit in the chain are forwarded to the second functional unit. Modern vector architectures use flexible chaining, which allows a vector instruction to chain to essentially any other active vector instruction, assuming that we don’t generate a structural hazard.

Chaining is just data forwarding from one vector functional unit to another.

Time

The execution time of a sequence of vector operations primarily depends on

Length of the operand vectors
Structural hazards among the operations
Data dependencies

A convoy is the set of vector instructions that could potentially execute together. The instructions in a convoy must not contain any structural hazards; if such hazards were present, the instructions would need to be serialized and initiated in different convoys. The performance of a section of code can be estimated by counting the number of convoys. Assume that a convoy of instructions must complete execution before any other instructions (scalar or vector) can begin execution.

A chime is the unit of time taken to execute one convoy. Thus, a vector sequence that consists of $m$ convoys executes in $m$ chimes. For a vector length of $n$ , this is approximately $m * n$ clock cycles. Measuring time in chimes gives a relative estimate of what fraction of the code is vectorizable.

Advantages

No dependencies within a vector. Pipelining, parallelization work well. Can have very deep pipelines
Each instruction generates a lot of work. This reduces instruction fetch bandwidth.
Highly regular memory access pattern. Can interleave multiple memory banks for higher memory bandwidth. Prefetching.
No need to explicitly code loops. Fewer branch instructions in the instruction sequence.

Disadvantages

Works (only) if parallelism is regular (data/SIMD parallelism). Very inefficient if parallelism is irregular.
Memory bandwidth can easily become a bottleneck if
- compute/memory operation balance is not maintained.
- data is not mapped appropriately to memory banks.