Optimizing Code

Optimizing Code#

Overview

This notebook will cover techniques for going from "Julia code that runs" to "Julia code that runs *fast*(er)!"

Another useful “one-stop-shop” resource in this category (there are many more!) is the Optimization post on Modern Julia Workflows.

Questions:

How do I time and benchmark code in Julia?
How do I avoid unnecessary memory allocations?
- Array views
- Re-using memory
How do I use statically sized arrays?
How to check for type stability

Objectives:

Learn common performance pitfalls and how to catch and avoid them.

using BenchmarkTools
using StaticArrays
using Random
using LinearAlgebra

ArgumentError: Package BenchmarkTools not found in current path.
- Run `import Pkg; Pkg.add("BenchmarkTools")` to install the BenchmarkTools package.

Stacktrace:
 [1] macro expansion
   @ ./loading.jl:2296 [inlined]
 [2] macro expansion
   @ ./lock.jl:273 [inlined]
 [3] __require(into::Module, mod::Symbol)
   @ Base ./loading.jl:2271
 [4] #invoke_in_world#3
   @ ./essentials.jl:1089 [inlined]
 [5] invoke_in_world
   @ ./essentials.jl:1086 [inlined]
 [6] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:2260

Row/Column Major:#

Julia is column-major, which means columns are contiguous in memory. This is the same as MATLAB and Fortran, but Numpy is row-major by default. Remeber this fact when looping over data as looping incorrectly comes with a big performance penalty!

Benchmarking Code#

Julia is just in time (JIT) compiled, so the first time you run code there will be a penalty to first compile the code before it can be run. We can use the @time macro to see this effect. (If you run the cell more than once, the effect will go away since it has already been compiled!)

x = rand(1000);
function sum_global()
    s = 0.0
    for i in x
        s += i
    end
    return s
end;

@time sum_global();
@time sum_global();

  0.009807 seconds (3.77 k allocations: 82.344 KiB, 98.61% compilation time)
  0.000122 seconds (3.49 k allocations: 70.156 KiB)

The @time macro is convenient, but only executes the function once. We can use the BenchmarkTools library to investigate the performance of our function with multiple trials. The @benchmark macro will run 10,000 trials or (roughly) 5 seconds whichever comes first. (These can be configured). There is also the @btime macro which runs the same number of trials as @benchmark but has output similar to @time.

@benchmark sum_global()

LoadError: UndefVarError: `@benchmark` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[4]:1

If your function takes parameters, be sure to “interpolate” them with a $ when using @benchmark this tells BenchmarkTools to ignore the allocation and time required to allocate the parameters.

# Parameter is Interpolated
@btime inv($(rand(6,6)));
# No Interpolation
@btime inv(rand(6,6));

LoadError: UndefVarError: `@btime` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[5]:2

Reducing Memory Allocations#

One of the easiest ways to speed up your code is by removing all unnecessary memory allocations. In Python the end-user is not able to manage memory easily and the interprer or library (e.g. numpy) is trusted to handle things.

Julia can also operate like Python, where we just trust the compiler to “do the right thing”. This is good for prototyping, but can result in inefficient code. Below we will look at the basics of memory allocations and how to reduce them.

Lets start by allocating a small array of 10,000 Float64s. Each Float64 is 8 bytes (64 bits), so the total size should be 80,000 bytes or 80kB (78.125 kiB). Note there will be a slight overhead for some internal machinery associated with each allocation (e.g. a pointer to the array).

@time ones(Float64, 100, 100);

  0.000013 seconds (3 allocations: 78.211 KiB)

In Julia, all arrays are heap-allocated since they are re-sizeable. This just means that the array is further away from the CPU and it can be slow(er) to allocate and access. The first strategy to mitigate this penalty is to simply allocate memory as little as possible. To check this we can use BenchmarkTools to understand the allocations in our function.

function allocates_alot()
    res = 0.0
    for i in 1:100
        x = rand(100,100) # ALLOCATION EVERY LOOP ITERATION
        res += sum(x)
    end
    return res
end

function allocates_once()
    res = 0.0
    storage = zeros(100,100)
    for i in 1:100
        # The ! means it mutates the input
        # In this case it overwrites `storage` with new random values
        rand!(storage)
        res += sum(storage)
    end
    return res
end

allocates_once (generic function with 1 method)

Using the @btime macro, we can measure the performace uplift and the number of allocations from each function. You’ll find that the function which allocates a new array each loop iteration will be slower and allocate far more memory.

@btime allocates_alot() samples = 1000;
@btime allocates_once() samples = 1000;

LoadError: UndefVarError: `@btime` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[8]:1

A more subtle, yet very common place where allocations occur is array slices (e.g. arr[1:10]). In Julia, these slices allocate a new array by default. We can get around this by telling Julia that we only want to read the data inside the arr. To do this we use an “array view” which can be acheived with the view(...) function or the @views macro.

function slices_allocate(x)
    N = Int(length(x) / 2)
    res = sum(x[1:N])
    return res
end

function view_slices(x)
    N = Int(length(x) / 2)
    # The view function takes the array
    # and the indicies you want a view of.
    res = sum(view(x, 1:N))
    return res
end

function view_macro_slices(x)
    N = Int(length(x) / 2)
    # The @views macro tells Julia all array slices in
    # this line should be array views.
    @views res = sum(x[1:N])
    return res
end

view_macro_slices (generic function with 1 method)

Again, we see a large speed increase and the number of allocations drop!

x = rand(100000)
@btime slices_allocate(x) samples = 1000;
@btime view_slices(x) samples = 1000;
@btime view_macro_slices(x) samples = 1000;

LoadError: UndefVarError: `@btime` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[10]:2

Exercise

Create your own array and define a view into the first 2 elements

Solution:

arr = [1,2,3,4,5]
arr_v = @views arr[1:2]
arr_v2 = view(arr, 1:2)

StaticArrays#

Earlier, we mentioned that Julia arrays are heap-allocated because they have variable length (i.e., you can append to them). If you know the length of your array beforehand, you can use the StaticArrays library.

StaticArrays has many types you can use, but the main types are SVector for 1D arrays, SMatrix for 2D arrays, and SArray for n-dimensional arrays (all types have equivalent macros e.g., @SMatrix).

To initialize a NxN SMatrix we must provide the size to the constructor: SMatrix{N,N}(...)

m = SMatrix{10,10}(rand(10,10));

UndefVarError: `SMatrix` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

Stacktrace:
 [1] top-level scope
   @ In[11]:1

A common pattern in atomistic simulation packages is to have an array of SVectors which represent atomic attributes, like their position. Storing the data for each atom as an SVector can be much faster than using the default Julia Vector. This is because the compiler is able to allocate the SVectors next to eachother in memory (better locality) whereas with a Vector of Vector the compiler will not store the data adjacent in memory.

N_atoms = 1000
static_positions = [SVector{3}(rand(3)) for i in 1:N_atoms];
dynamic_positions = [rand(3) for i in 1:N_atoms];

function distance_matrix(positions)
    N = length(positions)
    dists = zeros(N,N)
    for i in 1:N
        for j in i:N
            dists[i,j] = norm(positions[i] - positions[j])
            dists[j,i] = dists[i,j]
        end
    end
    return dists
end

UndefVarError: `SVector` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

Stacktrace:
 [1] (::var"#1#2")(i::Int64)
   @ Main ./none:0
 [2] iterate
   @ ./generator.jl:48 [inlined]
 [3] collect(itr::Base.Generator{UnitRange{Int64}, var"#1#2"})
   @ Base ./array.jl:791
 [4] top-level scope
   @ In[12]:1

println("Dynamic Array Benchmark:")
@benchmark distance_matrix(dynamic_positions)

Dynamic Array Benchmark:

LoadError: UndefVarError: `@benchmark` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[13]:2

println("Static Vector Benchmark:")
@benchmark distance_matrix(static_positions)

Static Vector Benchmark:

LoadError: UndefVarError: `@benchmark` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[14]:2

Note that SVectors etc. are statically sized and immutable. The StaticArrays package also provides statically sized but mutable variants, which are prefixed with M instead of S, e.g. MMatrix.

Type Stability#

Julia will do its best to infer the data types of your variables, but if the compiler cannot infer the type of all your variables this leads to type instability and will slow down your code. We can check type stability with the @code_warntype macro. For today’s purposes, if this macro returns any red text, then your code has unstable types.

In the example below, the functions are equivalent except one relies on a global variable and the other takes that variable as a parameter. Because of the global variable, the compiler must assume that the variables inside the unstable function have Any type. Whereas in the type-stable code, the compiler can create a specialized version of the stable function for the specific type you passed into it. This is not the only unstable pattern, so if you really care about speed, check your code with @code_warntype.

global_x = rand(Float64, 1000)
function unstable()
    s = 0
    for val in global_x
      s = s + val
    end
    return s
end

function stable(x)
  s = zero(eltype(x))
  for val in x
     s = s + val
  end
  return s
end

stable (generic function with 1 method)

@code_warntype unstable()

MethodInstance for unstable()

  from unstable() @ Main In[15]:2
Arguments
  #self#::Core.Const(Main.unstable)
Locals
  @_2::Any

  s::Any
  val::Any
Body::Any
1 ─

(s

 = 0)
│   %

2  = Main.global_x::Any

│         (@_2 = Base.iterate(%2))
│   %4  = @_2::Any
│   %5  = (%4 === nothing)::Bool
│   %6  = Base.not_int(%5)::Bool
└──       goto #4 if not %6
2 ┄ %8  = @_2::Any
│         (val = Core.getfield(%8, 1))
│   %10 = Core.getfield(%8, 2)::Any
│   %11 = s::Any
│   %12 = val::Any
│         (s = %11 + %12)
│         (@_2 = Base.iterate

(%2, %10))
│   %15 = @_2::Any
│   %16 = (%15 === nothing)::Bool
│   %17 = Base.not_int(%16)::Bool
└──       goto #4 if not %17
3 ─       goto #2
4 ┄ %20 = s::Any
└──       return %20

@code_warntype stable(global_x)

MethodInstance for stable(::Vector{Float64})
  from stable(x) @ Main In[15]:10
Arguments
  #self#::Core.Const(Main.stable)
  x::Vector{Float64}
Locals
  @_3::Union{Nothing, Tuple{Float64, Int64}}
  s::Float64
  val::Float64
Body::Float64
1 ─ %1  = Main.zero::Core.Const(zero)
│   %2  = Main.eltype::Core.Const(eltype)
│   %3  = (%2)(x)::Core.Const(Float64)
│         (s = (%1)(%3))
│   %5  = x::Vector{Float64}
│         (@_3 = Base.iterate(%5))
│   %7  = @_3::Union{Nothing, Tuple{Float64, Int64}}
│   %8  = (%7 === nothing)::Bool
│   %9  = Base.not_int(%8)::Bool
└──       goto #4 if not %9
2 ┄ %11 = @_3::Tuple{Float64, Int64}
│         (val = Core.getfield(%11, 1))
│   %13 = Core.getfield(%11, 2)::Int64
│   %14 = s::Float64
│   %15 = val::Float64
│         (s = %14 + %15)
│         (@_3 = Base.iterate(%5, %13))
│   %18 = @_3::Union{Nothing, Tuple{Float64, Int64}}
│   %19 = (%18 === nothing)::Bool
│   %20 = Base.not_int(%19)::Bool
└──       goto #4 if not %20
3 ─       goto #2
4 ┄ %23 = s::Float64
└──       return %23

@btime unstable();

LoadError: UndefVarError: `@btime` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[18]:1

@btime stable($global_x);

LoadError: UndefVarError: `@btime` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
in expression starting at In[19]:1