OpenCL – Architecture and Program


A platform is a collection of devices. The platform determines how data can be shared efficiently. If the platform supports both CPU and GPU the vendor would have optimized data flow between the two devices. Data sharing within a platform is more efficient than across platforms.


The individual CPU/GPU are called devices. The CPU device can be shared as both an OpenCL device and the host processor. Devices (CPU/GPU) are connected via a bus. Each device has a memory attached to it limited by its peak bandwidth (arrows).


The context is the environment within which the kernels execute. This environment includes

  • A set of devices. All devices in a context must be in the same platform.
  • The memory accessible to those devices
  • One or more command-queues used to schedule execution of a kernel(s) or operations on memory objects.

Contexts are used to contain and manage the state of the world in OpenCL. This includes

  • Kernel execution commands
  • Memory commands – transfer or mapping of memory object data
  • Synchronization commands – constrains the order of commands

Command Queues

To submit work to a device a command queue has to be created. The program can put work into this queue and eventually will make it to the top of the queue and get executed on the device. To execute on another device a new command queue has to be created. Thus a command queue is needed for every device. This means there is no automatic distribution of work across devices. Each device can run the same kernel, but may not get optimal performance (CPU vs GPU).

Each Command-queue points to a single device within a context. A single device can simultaneously be attached to multiple command queues. Both in-order and out-of-order queues.

Data Movement

No automatic data movement. The user gets full control of performance and must explicitly

  • Allocate global data
  • Write to it from the host
  • Allocate local data
  • Copy data from global to local (and back)

OpenCL Program

  1. Setup
    1. Get the devices (and platform)
    2. Create a context (for sharing between devices)
    3. Create command queues (for submitting work)
  2. Compilation
    1. Create a program
    2. Build the program (compile)
    3. Create kernels
  3. Create memory Objects
  4. Enqueue writes to copy data to the GPU
  5. Set the kernel arguments
  6. Enqueue kernel executions
  7. Enqueue reads to copy data back from the GPU
  8. Wait for your commands to finish
  9. Clean up OpenCL resources

OpenCL is asynchronous. When we enqueue a command we have no idea when it will finish. By explicitly waiting we make sure it is finished before continuing.

Leave a Reply

Your email address will not be published. Required fields are marked *