The index space supported by OpenCL is called an NDRange. An NDRange is an N-dimensional index space, where N is one, two or three. The NDRange is decomposed into work-groups forming blocks that cover the Index space. Each work-group consists of work-items which are conceptually similar to threads.
An NDRange is defined by two parameters
- The global size in each dimension
- The local size in each dimension
Each work-item is assigned to a work-group and given a local ID to represent its position within the workgroup. A work-item’s local ID is an N-dimensional tuple with components in the range from zero to the size of the work-group in that dimension minus one. This is shown in the figure on the right.
A work-item can also be referenced directly using global indices. Each work-item’s global ID is an N-dimensional tuple. The global ID components are values in the range from 0, to the number of elements in that dimension minus one. This is shown in the figure on the left.
Work-groups are assigned IDs similarly. The number of work-groups in each dimension is not directly defined but is inferred from the local and global NDRanges provided. The number of work-groups in a dimension is the ceiling of the global size in that dimension divided by the local size in the same dimension.
A work-group’s ID is an N-dimensional tuple with components in the range 0 to the number of work-groups minus one.
The combination of a work-group ID and the local-ID within a workgroup uniquely defines a work-item. Each work-item is identifiable in two ways; in terms of a global index, and in terms of a work-group index plus a local index within a work group.
During kernel execution, each dimension is executed in parallel. A work-item (thread) is executed for every point in the global dimensions.
|Global Dimensions||# Work Items|
|1920 * 1080||2M|
|256 * 256 * 256||16M|
The global dimensions are broken down evenly into local work-groups. The host code can define the partitioning to work-groups, or leave it to the implementation.
Each work-group is logically executed together on one compute unit. Synchronization is only allowed between work-items in the same work-group.
A Work item is similar to a thread in terms of its control flow, and its memory model, distinguished from other executions within the collection by its global ID and local ID. Data sharing is possible between work items via local memory. Synchronization between work items happens via barriers and memory fences.
A Work group is a collection of related work-items that must map to a single compute unit. Work groups cannot synchronize with each other. OpenCL only supports global synchronization at the end of a kernel execution.
An work-group runs in its entirety on one compute unit. Work-items within a work-group cannot be shared across compute units. If the size of a work-group does not match the hardware of the compute unit, it results in bad utilization.
For example consider a work-group that can hold threads. A compute unit has hardware to process threads at a time. The compute unit will work at full utilization during the first three cycles. During the last cycle only threads need to be run which leads in bad utilization.
If instead, the work-group was defined to hold threads, the compute unit will work at full utilization during all four cycles.