Advanced Topics#
Overview
This notebook will cover basic multi-threading and programming Nvidia GPUs in Julia.Check out the JuliaGPU orginization page for more on how to program GPUs in Julia as well as how to program GPUs from other vendors like AMD and Intel.
Objectives:
How to parallelize for loops with
Threads.@threads
.Demonstrate the high-level interface for programming GPUs in Julia.
using BenchmarkTools
using OhMyThreads
ArgumentError: Package BenchmarkTools not found in current path.
- Run `import Pkg; Pkg.add("BenchmarkTools")` to install the BenchmarkTools package.
Stacktrace:
[1] macro expansion
@ ./loading.jl:2296 [inlined]
[2] macro expansion
@ ./lock.jl:273 [inlined]
[3] __require(into::Module, mod::Symbol)
@ Base ./loading.jl:2271
[4] #invoke_in_world#3
@ ./essentials.jl:1089 [inlined]
[5] invoke_in_world
@ ./essentials.jl:1086 [inlined]
[6] require(into::Module, mod::Symbol)
@ Base ./loading.jl:2260
Multi-Threading#
Julia’s base Threads library provides the Threads.@threads
macro to turn any for loop into a parallel for loop. To change the number of threads Julia has access to we can define the environment variable JULIA_NUM_THREADS
. This must be done before starting the notebook (e.g. export JULIA_NUM_THREADS=4
).
println("Julia has access to $(Threads.nthreads()) threads")
print(ENV["JULIA_NUM_THREADS"])
Julia has access to 2 threads
2
function single_thread!(vals)
for i in eachindex(vals)
vals[i] = sqrt(vals[i])*sqrt(vals[i])
end
return vals
end
function all_threads!(vals)
Threads.@threads :static for i in eachindex(vals)
vals[i] = sqrt(vals[i])*sqrt(vals[i])
end
return vals
end
function broadcasted!(vals)
vals .= sqrt.(vals).*sqrt.(vals)
return vals
end
broadcasted! (generic function with 1 method)
vals = abs.(rand(1000000));
@benchmark single_thread!($vals)
LoadError: UndefVarError: `@benchmark` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[5]:1
@benchmark broadcasted!($vals)
LoadError: UndefVarError: `@benchmark` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[6]:1
@benchmark all_threads!($vals)
LoadError: UndefVarError: `@benchmark` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[7]:1
The Threads library does not let you set the number of tasks, to do that we can use the OhMyThreads.jl
library. In OhMyThreads
the Threads.@threads
macro is replaced with the @tasks
macro inside of which we can set certain properties like ntasks
as well as the type of thread scheduling.
function ohmythreads!(vals, n_threads = Threads.nthreads())
@tasks for i in eachindex(vals)
@set ntasks = n_threads
@set scheduler = :static
vals[i] = sqrt(vals[i])*sqrt(vals[i])
end
return vals
end
LoadError: UndefVarError: `@tasks` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[8]:2
@benchmark ohmythreads!(vals)
LoadError: UndefVarError: `@benchmark` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[9]:1
GPU Array Programming#
With the power of multiple dispatch, we can perform many operations on the GPU just by changing the type of our data from Array
to CuArray
(or the analagous type for AMD, Intel, or Apple). The packge environment does not have access to a GPU so the test the cells below install one of the CUDA.jl
/AMDGPU.jl
/Metal.jl
/OneAPI.jl
packages and try running the code yourself! The example below is written for CUDA enabled GPUs.
# This example is for CUDA but is the same principle for all the GPU libraries
#using CUDA or AMDGPU or oneAPI or Metal or KernelAbstractions
using CUDA
# Moving data from CPU to GPU.
A = rand(3,3)
cu_B = CuArray{Float32}(A)
# Can also pre-allocate storage on the GPU and copy data to it.
gpu_storage = CUDA.zeros(Float32, 3, 3)
gpu_array2 = copyto!(cu_B, A)
If we multiply cu_B
by itself, this multiplcation will take place on the GPU automatically because the type of cu_B
is a CuArray
!
cu_C = cu_B * cu_B
There are many more high-level functions that “just work” on the GPU. For example, most unary operations like +
,-
,*
,sin
have GPU implementations. There are also abstractions such as reduce
and mapreduce
which allow you to apply an operation or function across an array. Note that all high-level GPU operations should act on an entire array and not a single element at a time. Checkout the CUDA.jl docs on array programming for more complex examples.
cu_C = CUDA.rand(5)
sum_of_C = sum(cu_C)
sum_of_C_also = reduce(+, cu_C)
f = (x) -> x^2
C_squared = map(f, cu_C)
sum_of_C_squared = mapreduce(f, +, cu_C)
You’re also able to write custom GPU kernels inside of Julia. The KernelAbstractions library unifies all GPU backends into a common interface. This allows you to write a single GPU kernel which can run on any GPU vendor’s hardware. Each vendor-specific library also has a kernel programming model. GPU kernels can get extremely complex, please read the full documentation for more details. In general, you will define your kernel function, and invoke it with a macro that specifies the hardware topology:
function my_kernel(a)
i = threadIdx().x
a[i] = 42
return
end
a = CuArray{Int}(undef, 5);
@cuda threads=length(a) my_kernel(a);