Efficient Scientific Computing 2024

GPU Portability

Thread Organization
- CUDA: Grids, blocks, and threads.
- OpenMP: Teams and threads.
- SYCL: ND-ranges, work groups, and work items.
- HIP: Similar to CUDA, with minor syntax changes.
Kernel Execution
- CUDA: global functions launched with «<»> syntax.
- OpenMP: #pragma omp target directives to offload code.
- SYCL: Using queue.submit() with lambda functions.
- - HIP: global functions launched with hipLaunchKernelGGL.
Memory Management
- Explicit Control: All models require managing data movement between host and device.
- Unified Memory Options: Some models offer unified memory spaces to simplify development.
Memory Allocation
- CUDA: cudaMalloc(&ptr, size);
- HIP: hipMalloc(&ptr, size);
- SYCL: malloc_device(size, queue);
- OpenMP: Memory mapped with map clauses.

Launching kernels with execution configuration: kernel<<<gridDim, blockDim>>>(args);

Similar syntax with slight differences: hipLaunchKernelGGL(kernel, gridDim, blockDim, sharedMem, stream, args);

Using command groups and lambda expressions:

queue.submit([&](handler &h) {
    h.parallel_for(nd_range<1>(globalRange, localRange), [=](nd_item<1> item) {
    // Kernel code
    });
});

Offloading code blocks with pragmas:

#pragma omp target teams distribute parallel for
for (int i = 0; i < N; i++) {
  // Loop body
}

Have a look in the hands-on/portable_stencil directory for examples of CUDA, HIP, SYCL, and OpenMP code.

This site is open source. Improve this page.