Advanced Topics

Advanced Topics#

Overview

This notebook will cover basic multi-threading and programming Nvidia GPUs in Julia.

Check out the JuliaGPU orginization page for more on how to program GPUs in Julia as well as how to program GPUs from other vendors like AMD and Intel.

Objectives:

How to parallelize for loops with Threads.@threads.
Demonstrate the high-level interface for programming GPUs in Julia.

using BenchmarkTools
using OhMyThreads

ArgumentError: Package BenchmarkTools not found in current path.
- Run `import Pkg; Pkg.add("BenchmarkTools")` to install the BenchmarkTools package.

Stacktrace:
 [1] macro expansion
   @ ./loading.jl:2296 [inlined]
 [2] macro expansion
   @ ./lock.jl:273 [inlined]
 [3] __require(into::Module, mod::Symbol)
   @ Base ./loading.jl:2271
 [4] #invoke_in_world#3
   @ ./essentials.jl:1089 [inlined]
 [5] invoke_in_world
   @ ./essentials.jl:1086 [inlined]
 [6] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:2260

Multi-Threading#

Julia’s base Threads library provides the Threads.@threads macro to turn any for loop into a parallel for loop. To change the number of threads Julia has access to we can define the environment variable JULIA_NUM_THREADS. This must be done before starting the notebook (e.g. export JULIA_NUM_THREADS=4).

println("Julia has access to $(Threads.nthreads()) threads")
print(ENV["JULIA_NUM_THREADS"])

Julia has access to 2 threads
2

function single_thread!(vals)
    for i in eachindex(vals)
        vals[i] = sqrt(vals[i])*sqrt(vals[i])
    end
    return vals
end

function all_threads!(vals)
    Threads.@threads :static for i in eachindex(vals)
        vals[i] = sqrt(vals[i])*sqrt(vals[i])
    end
    return vals
end

function broadcasted!(vals)
    vals .= sqrt.(vals).*sqrt.(vals)
    return vals
end

broadcasted! (generic function with 1 method)

vals = abs.(rand(1000000));

@benchmark single_thread!($vals)

LoadError: UndefVarError: `@benchmark` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[5]:1

@benchmark broadcasted!($vals)

LoadError: UndefVarError: `@benchmark` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[6]:1

@benchmark all_threads!($vals)

LoadError: UndefVarError: `@benchmark` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[7]:1

The Threads library does not let you set the number of tasks, to do that we can use the OhMyThreads.jl library. In OhMyThreads the Threads.@threads macro is replaced with the @tasks macro inside of which we can set certain properties like ntasks as well as the type of thread scheduling.

function ohmythreads!(vals, n_threads = Threads.nthreads())
    @tasks for i in eachindex(vals)
        @set ntasks = n_threads
        @set scheduler = :static
        vals[i] = sqrt(vals[i])*sqrt(vals[i])
    end
    return vals
end

LoadError: UndefVarError: `@tasks` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[8]:2

@benchmark ohmythreads!(vals)

LoadError: UndefVarError: `@benchmark` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[9]:1

GPU Array Programming#

With the power of multiple dispatch, we can perform many operations on the GPU just by changing the type of our data from Array to CuArray (or the analagous type for AMD, Intel, or Apple). The packge environment does not have access to a GPU so the test the cells below install one of the CUDA.jl/AMDGPU.jl/Metal.jl/OneAPI.jl packages and try running the code yourself! The example below is written for CUDA enabled GPUs.

# This example is for CUDA but is the same principle for all the GPU libraries
#using CUDA or AMDGPU or oneAPI or Metal or KernelAbstractions
using CUDA

# Moving data from CPU to GPU.
A = rand(3,3)
cu_B = CuArray{Float32}(A)

# Can also pre-allocate storage on the GPU and copy data to it.
gpu_storage = CUDA.zeros(Float32, 3, 3)
gpu_array2 = copyto!(cu_B, A)

If we multiply cu_B by itself, this multiplcation will take place on the GPU automatically because the type of cu_B is a CuArray!

cu_C = cu_B * cu_B 

There are many more high-level functions that “just work” on the GPU. For example, most unary operations like +,-,*,sin have GPU implementations. There are also abstractions such as reduce and mapreduce which allow you to apply an operation or function across an array. Note that all high-level GPU operations should act on an entire array and not a single element at a time. Checkout the CUDA.jl docs on array programming for more complex examples.

cu_C = CUDA.rand(5)
sum_of_C = sum(cu_C)
sum_of_C_also = reduce(+, cu_C)

f = (x) -> x^2
C_squared = map(f, cu_C)
sum_of_C_squared = mapreduce(f, +, cu_C)

You’re also able to write custom GPU kernels inside of Julia. The KernelAbstractions library unifies all GPU backends into a common interface. This allows you to write a single GPU kernel which can run on any GPU vendor’s hardware. Each vendor-specific library also has a kernel programming model. GPU kernels can get extremely complex, please read the full documentation for more details. In general, you will define your kernel function, and invoke it with a macro that specifies the hardware topology:

function my_kernel(a)
    i = threadIdx().x
    a[i] = 42
    return
end

a = CuArray{Int}(undef, 5);

@cuda threads=length(a) my_kernel(a);

Advanced Topics

Contents

Advanced Topics#

Overview

Multi-Threading#

GPU Array Programming#