The OpenCL memory model describes the structure, contents, and behavior of the memory exposed by an OpenCL platform as an OpenCL program runs. The model allows a programmer to reason about values in memory as the host program and multiple kernel-instances execute.
An OpenCL program defines a context that includes a host, one or more devices, command queues, and memory exposed within the context. Consider the units of execution involved with such a program. The host program runs as one or more host threads managed by the operating system running on the host (the details of which are defined outside of OpenCL). There may be multiple devices in a single context which all have access to memory objects defined by OpenCL. On a single device, multiple work-groups may execute in parallel with potentially overlapping updates to memory. Finally, within a single work-group, multiple work-items concurrently execute, once again with potentially overlapping updates to memory.
Memory in OpenCL is divided into two parts.
Host Memory
The memory directly available to the host. The detailed behavior of host memory is defined outside of OpenCL.
Device Memory
Memory directly available to kernels executing on OpenCL devices.
Device memory consists of four named address spaces or memory regions:
Global Memory
This memory region permits read/write access to all work-items in all work-groups running on any device within a context. Work-items can read from or write to any element of a memory object.
Constant Memory
A region of global memory that remains constant during the execution of a kernel-instance. The host allocates and initializes memory objects placed into constant memory.
Local Memory
A memory region local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that work-group.
Private Memory
A region of memory private to a work-item. Variables defined in one work-item’s private memory are not visible to another work-item.
Memory Allocation
First, allocate space in the global memory of the device. Data has to be explicitly copied from the host memory to the device’s global memory. Once done, all of the compute units in the GPU can access this global memory.
To use local memory, first allocate space in it. The same space is allocated across all the compute units. In order to move the data from the global memory to the local memory, the user has to write a program that runs on the GPU and manually copies the data into the local memory. This will be part of the kernel code.