CUDA – Dimensions, Mapping and Indexing

Kernel Invocation

The host calls a kernel using a triple chevron \lll \ggg. In the chevrons we place the number of blocks and the number of threads per block.

\texttt{SomeKernel} \lll100, 256\ggg would launch 100 blocks of 256 threads each (total of 25600 threads).

\texttt{SomeOtherKernel} \lll50, 1024\ggg would launch 50 blocks of 1024 threads each (51200 threads in total).

Dimensions

As many parallel applications involve multidimensional data, it is convenient to organize thread blocks into 1D, 2D or 3D arrays of threads. The blocks in a grid must be able to be executed independently, as communication or cooperation between blocks in a grid is not possible. When a kernel is launched the number of threads per thread block, and the number of thread blocks is specified, this, in turn, defines the total number of CUDA threads launched. For example if the maximum x, y and z dimensions of a block are 512, 512 and 64, it should be allocated such that x \times y \times z \leq 512, which is the maximum number of threads per block. The limitation on the number of threads in a block is actually imposed because the number of registers that can be allocated across all threads is limited. Blocks can be organized into one- or two-dimensional grids (say up to 65,535 blocks) in each dimension.

dim3 is a 3d structure or vector type with three integers, x, y and z. One can initialise as many of the three coordinates as they like

   dim3 threads(256);           // Initialise with x as 256, y and z will both be 1
   dim3 blocks(100, 100);       // Initialise x and y, z will be 1
   dim3 anotherOne(10, 54, 32); // Initialises all three values, x will be 10, y gets 54 and z will be 32.

Mapping

Every thread in CUDA is associated with a particular index so that it can calculate and access memory locations in an array.

NameDescriptionxyz
threadIdxThread index within the block (zero-based)threadIdx.xthreadIdx.ythreadIdx.z
blockIdxBlock index within the grid (zero-based)blockIdx.xblockIdx.yblockIdx.z
blockDimBlock dimensions in threadsblockDim.xblockDim.yblockDim.z
gridDimGrid dimensions in blocksgridDim.xgridDim.ygridDim.z

Each of the above are dim3 structures and can be read in the kernel to assign particular workloads to any thread.

Indexing

Here is an example indexing scheme based on the mapping defined above.

The grid (on the left) has size (5,4,1), that is, it has 5 blocks in the x direction, 4 blocks in the y direction, and 1 block in the z direction.

Each block (on the right) is of size (5,5,1) with 5 threads along the x and y directions, and 1 thread along the z direction.

At the grid level (on the left), the tuple for each block is the 3D index, e.g. (0,0,0), and below the 3D index is the 1D index of each block, e.g. [0].

At the block level (on the right), a similar indexing scheme applies, where the tuple is the 3D index of the thread within the block and the number in the square bracket is the 1D index of the thread within the block.

During execution, the CUDA threads are mapped to the problem in an undefined manner. Randomly completed threads and blocks are shown as green to highlight the fact that the order of execution for threads is undefined.

There are a total of 20 blocks in a grid. Each block has 25 threads for a total of 500 threads. Here are the steps to find the indices for a particular thread, say thread 343. This number has to be expressed in terms of the block size.

    \begin{equation*} 343 = 25 \times 13 + 18 \end{equation*}

With respect to 0-indexing, the 17th thread of the 13th block is thread 343.

From the figure, the 13th block maps to the coordinates (3, 2, 0) and the 17th thread maps to the coordinates (2, 3, 0). Thus thread 343 is indexed by

    \begin{align*} blockIdx.x &= 3 \\ blockIdx.y &= 2 \\ blockIdx.z &= 0 \\ threadIdx.x &= 2 \\ threadIdx.y &= 3 \\ threadIdx.z &= 0 \\ \end{align*}

Leave a Reply

Your email address will not be published. Required fields are marked *