/

/

CUDA.jl 5.0 - Integrated Profiles and Task Synchronization Changes

/

/

CUDA.jl 5.0 - Integrated Profiles and Task Synchronization Changes

CUDA.jl 5.0 - Integrated Profiles and Task Synchronization Changes

CUDA.jl 5.0 - Integrated Profiles and Task Synchronization Changes

Date Published

Sep 19, 2023

Sep 19, 2023

Contributors

Share

Share

Date Published

Sep 19, 2023

Contributors

Share

CUDA.jl 5.0 is a major release that adds an integrated profiler to CUDA.jl, and reworks how tasks are synchronized. The release is slightly breaking, as it changes how local toolkits are handled and raises the minimum Julia and CUDA versions

Integrated Profiler

The most exciting new feature in CUDA.jl 5.0 is the new integrated profiler, which is similar to the @profile macro from the Julia standard library. The profiler can be used by simply prefixing any code that uses the CUDA libraries with CUDA.@profile:

julia> CUDA.@profile CUDA.rand(1).+1
Profiler ran 268.46 µs, capturing 21 events.

Host-side activity: calling CUDA APIs took 230.79 µs (85.97% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
 Time (%)       Time  Calls   Avg time   Min time   Max time  Name                    
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
   76.47%  205.28 µs      1  205.28 µs  205.28 µs  205.28 µs  cudaLaunchKernel        
    5.42%   14.54 µs      2    7.27 µs    5.01 µs    9.54 µs  cuMemAllocFromPoolAsync 
    2.93%    7.87 µs      1    7.87 µs    7.87 µs    7.87 µs  cuLaunchKernel          
    0.36%  953.67 ns      2  476.84 ns     0.0 ns  953.67 ns  cudaGetLastError        
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 2.15 µs (0.80% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬──────────────────────────────
 Time (%)       Time  Calls   Avg time   Min time   Max time  Name                        
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼──────────────────────────────
    0.44%    1.19 µs      1    1.19 µs    1.19 µs    1.19 µs  _Z13gen_sequencedI17curandS 
    0.36%  953.67 ns      1  953.67 ns  953.67 ns  953.67 ns  _Z16broadcast_kernel15CuKer 
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴──────────────────────────────
                                                                                  1 column omitted
1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.7242923

The output shown above is a summary of what happened during the execution of the code. It is split into two sections: host-side activity, i.e., API calls to the CUDA libraries, and the resulting device-side activity. As part of each section, the output shows the time spent and the ratio to the total execution time. These ratios are important, and a good tool to quickly assess the performance of your code. For example, in the above output, we see that most of the time is spent on the host calling the CUDA libraries, and only very little time is actually spent computing things on the GPU. This indicates that the GPU is severely underutilized, which can be solved by increasing the problem size.

Instead of a summary, it is also possible to view a chronological trace by passing the trace=true keyword argument:

julia> CUDA.@profile trace=true CUDA.rand(1).+1;
Profiler ran 262.98 µs, capturing 21 events.

Host-side activity: calling CUDA APIs took 227.21 µs (86.40% of the trace)
┌────┬───────────┬───────────┬─────────────────────────┬────────────────────────┐
 ID      Start       Time                     Name  Details                
├────┼───────────┼───────────┼─────────────────────────┼────────────────────────┤
  5    6.44 µs    9.06 µs  cuMemAllocFromPoolAsync  4 bytes, device memory 
  7   19.31 µs  715.26 ns         cudaGetLastError  -                      
  8   22.41 µs  204.09 µs         cudaLaunchKernel  -                      
  9  227.21 µs     0.0 ns         cudaGetLastError  -                      
 14   232.7 µs    3.58 µs  cuMemAllocFromPoolAsync  4 bytes, device memory 
 18  250.34 µs    7.39 µs           cuLaunchKernel  -                      
└────┴───────────┴───────────┴─────────────────────────┴────────────────────────┘

Device-side activity: GPU was busy for 2.38 µs (0.91% of the trace)
┌────┬───────────┬─────────┬─────────┬────────┬──────┬────────────────────────────────────────────
 ID      Start     Time  Threads  Blocks  Regs  Name                                      
├────┼───────────┼─────────┼─────────┼────────┼──────┼────────────────────────────────────────────
  8  225.31 µs  1.19 µs       64      64    38  _Z13gen_sequencedI17curandStateXORWOWfiXa 
 18  257.73 µs  1.19 µs        1       1    18  _Z16broadcast_kernel15CuKernelContext13Cu 
└────┴───────────┴─────────┴─────────┴────────┴──────┴────────────────────────────────────────────
                                                                                  1 column omitted

Here, we can see a list of events that the profiler captured. Each event has a unique ID, which can be used to correlate host-side and device-side events. For example, we can see that event 8 on the host is a call tocudaLaunchKernel, which corresponds to the execution of a CURAND kernel on the device.

The integrated profiler is a great tool to quickly assess the performance of your GPU application, identify bottlenecks, and find opportunities for optimization. For complex applications, however, it is still recommended to use NVIDIA’s NSight Systems or Compute profilers, which provide a more detailed, graphical view of what is happening on the GPU.

Synchronization on worker threads

Another noteworthy change affects how tasks are synchronized. To enable concurrent execution, i.e., to make it possible for other Julia tasks to execute while waiting for the GPU to finish, CUDA.jl used to rely on so-called stream callbacks. These callbacks were a significant source of latency, at least 25us per invocation but sometimes much longer, and have also been slated for deprecation and eventual removal from the CUDA toolkit.

Instead, on Julia 1.9 and later, CUDA.jl now uses worker threads to wait for GPU operations to finish. This mechanism is significantly faster, taking around 5us per invocation, but more importantly offers a much more reliable and predictable latency. You can observe this mechanism using the integrated profiler:

julia> a = CUDA.rand(1024, 1024, 1024)
julia> CUDA.@profile trace=true CUDA.@sync a .+ a
Profiler ran 12.29 ms, capturing 527 events.

Host-side activity: calling CUDA APIs took 11.75 ms (95.64% of the trace)
┌─────┬───────────┬───────────┬────────┬─────────────────────────┐
  ID      Start       Time  Thread                     Name 
├─────┼───────────┼───────────┼────────┼─────────────────────────┤
   5    6.91 µs   13.59 µs       1  cuMemAllocFromPoolAsync 
   9   36.72 µs  199.56 µs       1           cuLaunchKernel 
 525  510.69 µs   11.75 ms       2      cuStreamSynchronize 
└─────┴───────────┴───────────┴────────┴─────────────────────────┘

For some users, this may still be too slow, so we have added two mechanisms that disable nonblocking synchronization and simply block the calling thread until the GPU operation finishes. The first is a global setting, which can be enabled by setting the nonblocking_synchronization preference to false, which can be done using Preferences.jl. The second is a fine-grained flag to pass to synchronization functions: synchronize (x;b; blocking=true),
CUDA.@sync blocking=true ..., etc. Both these mechanisms should not be used widely, and are only intended for use in latency-critical code, e.g., when benchmarking or profiling.

Local toolkit discovery

One of the breaking changes involves how local toolkits are discovered when opting out of the use of artifacts. Previously, this could be enabled by calling CUDA.set_runtime_version! ("local") , which generated a version = "local" preference. We are now changing this into two separate preferences, version and local, where the version preference overrides the version of the CUDA toolkit, and the local preference independently indicates whether to use a local CUDA toolkit or not.

Concretely, this means that you will now need to call CUDA.set_runtime_version! (local_toolkit=true) to enable the use of a local toolkit. The toolkit version will be auto-detected, but can be overridden by also passing a version: CUDA.set_runtime_version!  (version;local_toolkit=true).This may be necessary when CUDA is not available during precompilation, e.g., on the log-in node of a cluster, or when building a container image.

Raised minimum requirements

Finally, CUDA.jl 5.0 raises the required Julia and CUDA versions. The minimum Julia version is now 1.8, which should be enforced by the Julia package manager. The minimum CUDA toolkit version is now 11.4, but this cannot be enforced by the package manager. As a result, if you need to use an older version of the CUDA toolkit, you will need to pin CUDA.jl to v4.4 or below. The README now contains a table of supported CUDA toolkit versions and the corresponding CUDA.jl release.

Most users will not be affected by this change: If you use the artifact-provided CUDA toolkit, you will automatically get the latest version supported by your CUDA driver.

Other changes

Learn about Dyad

Get Dyad Studio – Download and install the IDE to start building hardware like software.

Read the Dyad Documentation – Dive into the language, tools, and workflow.

Join the Dyad Community – Connect with fellow engineers, ask questions, and share ideas.

Learn about Dyad

Get Dyad Studio – Download and install the IDE to start building hardware like software.

Read the Dyad Documentation – Dive into the language, tools, and workflow.

Join the Dyad Community – Connect with fellow engineers, ask questions, and share ideas.

Contact Us

Want to get enterprise support, schedule a demo, or learn about how we can help build a custom solution? We are here to help.

Contact Us

Want to get enterprise support, schedule a demo, or learn about how we can help build a custom solution? We are here to help.