MPI Hands-On - mpi4py#

Overview

Questions:

How can I use MPI to parallelize a Python code?

Objectives:

Learn how to prepare an environment that includes mpi4py.
Learn the basics of writing an MPI-parallelized code.
Explore point-to-point and collective MPI operations

Example 2#

Basic Infrastructure#

We will now do some work with the script in example2.py, which does some simple math with NumPy arrays. Run the code now.

SHELL

$ python example2.py

OUTPUT

Average: 5000001.5

Let’s learn something about which parts of this code account for most of the run time. MPI4Py provides a timer, MPI.Wtime(), which returns the current walltime. We can use this function to determine how long each section of the code takes to run.

For example, to determine how much time is spent initializing array a, do the following:

PYTHON

    # initialize a
    start_time = MPI.Wtime()
    a = np.ones( N )
    end_time = MPI.Wtime()
    if my_rank == 0:
        print("Initialize a time: " + str(end_time-start_time))

As the above code indicates, we don’t really want every rank to print the timings, since that could look messy in the output. Instead, we have only rank 0 print this information. Of course, this requires that we add a few lines near the top of the code to query the rank of each process:

PYTHON

    # get basic information about the MPI communicator
    world_comm = MPI.COMM_WORLD
    world_size = world_comm.Get_size()
    my_rank = world_comm.Get_rank()

Also determine and print the timings of each of the other sections of the code: the intialization of array b, the addition of the two arrays, and the final averaging of the result. Your code should look something like this:

PYTHON

import numpy as np

if __name__ == "__main__":

    # get basic information about the MPI communicator
    world_comm = MPI.COMM_WORLD
    world_size = world_comm.Get_size()
    my_rank = world_comm.Get_rank()

    N = 10000000

    # initialize a
    start_time = MPI.Wtime()
    a = np.ones( N )
    end_time = MPI.Wtime()
    if my_rank == 0:
        print("Initialize a time: " + str(end_time-start_time))

    # initialize b
    start_time = MPI.Wtime()
    b = np.zeros( N )
    for i in range( N ):
        b[i] = 1.0 + i
    end_time = MPI.Wtime()
    if my_rank == 0:
        print("Initialize b time: " + str(end_time-start_time))

    # add the two arrays
    start_time = MPI.Wtime()
    for i in range( N ):
        a[i] = a[i] + b[i]
    end_time = MPI.Wtime()
    if my_rank == 0:
        print("Add arrays time: " + str(end_time-start_time))

    # average the result
    start_time = MPI.Wtime()
    sum = 0.0
    for i in range( N ):
        sum += a[i]
    average = sum / N
    end_time = MPI.Wtime()
    if my_rank == 0:
        print("Average result time: " + str(end_time-start_time))
        print("Average: " + str(average))

Now run the code again:

SHELL

$ python example2.py

OUTPUT

Initialize a time: 0.03975701332092285
Initialize b time: 1.569957971572876
Add arrays time: 4.173098087310791
Average result time: 2.609341859817505
Average: 5000001.5

Point-to-Point Communication#

You can try running this on multiple ranks now:

SHELL

$ mpiexec -n 4 python example2.py

OUTPUT

Initialize a time: 0.042365074157714844
Initialize b time: 1.9863519668579102
Add arrays time: 4.9583611488342285
Average result time: 2.9468209743499756
Average: 5000001.5

Running on multiple ranks doesn’t help with the timings, because each rank is duplicating all of the same work. We want the ranks to cooperate on the problem, with each rank working on a different part of the calculation. In this example, that means that different ranks will work on different parts of the arrays a and b, and then the results on each rank will be summed across all the ranks.

We need to decide what parts of the arrays each of the ranks will work on; this is more generally known as a rank’s workload. Add the following code just before the initialization of array a:

PYTHON

    # determine the workload of each rank
    workloads = [ N // world_size for i in range(world_size) ]
    for i in range( N % world_size ):
        workloads[i] += 1
    my_start = 0
    for i in range( my_rank ):
        my_start += workloads[i]
    my_end = my_start + workloads[my_rank]

In the above code, my_start and my_end represent the range over which each rank will perform mathematical operations on the arrays.

We’ll start by parallelizing the code that averages the result. Update the range of the for loop in this part of the code to the following:

PYTHON

    for i in range( my_start, my_end ):

This will ensure that each rank is only calculating elements my_start through my_end of the sum. We then need the ranks to communicate their individually calculated sums so that we can calculate the global sum. To do this, replace the line average = sum / N with:

PYTHON

    if my_rank == 0:
        world_sum = sum
        for i in range( 1, world_size ):
      	    sum_np = np.empty( 1 )
            world_comm.Recv( [sum_np, MPI.DOUBLE], source=i, tag=77 )
            world_sum += sum_np[0]
        average = world_sum / N
    else:
        sum_np = np.array( [sum] )
        world_comm.Send( [sum_np, MPI.DOUBLE], dest=0, tag=77 )

The MPI.DOUBLE parameter tells MPI what type of information is being communicated by the Send and Recv calls. In this case, we are sending a array of double precision numbers. If you are communicating information of a different datatype, consult the following:

MPI4Py data type	C data type
`MPI.BYTE`	8 binary digits
`MPI.CHAR`	char
`MPI.UNSIGNED_CHAR`	unsigned char
`MPI.SHORT`	signed short int
`MPI.UNSIGNED_SHORT`	unsigned short int
`MPI.INT`	signed int
`MPI.UNSIGNED`	unsigned int
`MPI.LONG`	signed long int
`MPI.UNSIGNED_LONG`	unsigned long int
`MPI.FLOAT`	float
`MPI.DOUBLE`	double

Now run the code again:

SHELL

$ mpiexec -n 4 python example2.py

OUTPUT

Initialize a time: 0.04637002944946289\
Initialize b time: 1.9484930038452148\
Add arrays time: 4.914314031600952\
Average result time: 0.6889588832855225\
Average: 5000001.5

You can see that the amount of time spent calculating the average has indeed gone down.

Parallelizing the part of the code that adds the two arrays is much easier. All you need to do is update the range over which the for loop iterates:

PYTHON

    for i in range( my_start, my_end ):

Now run the code again:

SHELL

$ mpiexec -n 4 python example2.py

OUTPUT

Initialize a time: 0.04810309410095215
Initialize b time: 2.0196259021759033
Add arrays time: 1.2053139209747314
Average result time: 0.721329927444458
Average: 5000001.5

The array addition time has gone down nicely. Surprisingly enough, the most expensive part of the calculation is now the initialization of array b. Updating the range over which that loop iterates speeds up that part of the calation:

PYTHON

    for i in range( my_start, my_end ):

SHELL

$ mpiexec -n 4 python example2.py

OUTPUT

Initialize a time: 0.04351997375488281\
Initialize b time: 0.503791093826294\
Add arrays time: 1.2048840522766113\
Average result time: 0.7626049518585205\
Average: 5000001.5

Reducing the Memory Footprint#

The simulation is running much faster now thanks to the parallelization we have added. If that’s all we care about, we could stop working on the code now. In reality, though, time is only one resource we should be concerned about. Another resource that is often even more important is memory. The changes we have made to the code make it run faster, but don’t decrease its memory footprint in any way: each rank allocates arrays a and b with N double precision values. That means that each rank allocates 2*N double precision values; across all of our ranks, that corresponds to a total of 2*nproc*world_size double precision values. Running on more processors might decrease our run time, but it increases our memory footprint!

Of course, there isn’t really a good reason for each rank to allocate the entire arrays of size N, because each rank will only ever use values within the range of my_start to my_end. Let’s modify the code so that each rank allocates a and b to a size of workloads[my_rank].

Replace the initialization of a with:

PYTHON

    a = np.ones( workloads[my_rank] )

Replace the initialization of b with:

PYTHON

    b = np.zeros( workloads[my_rank] )
    for i in range( workloads[my_rank] ):
        b[i] = 1.0 + ( i + my_start )

Replace the range of the loops that add and sum the arrays to range( workloads[my_rank] ).

Run the code again:

SHELL

$ mpiexec -n 4 python example2.py

OUTPUT

Initialize a time: 0.009948015213012695\
Initialize b time: 0.5988950729370117\
Add arrays time: 1.2081310749053955\
Average result time: 0.7307591438293457\
Average: 5000001.5

Collective Communication#

Previously, we used point-to-point communication (i.e. Send and Recv) to sum the results across all ranks:

PYTHON

    if my_rank == 0:
        world_sum = sum
        for i in range( 1, world_size ):
            sum_np = np.empty( 1 )
            world_comm.Recv( [sum_np, MPI.DOUBLE], source=i, tag=77 )
            world_sum += sum_np[0]
        average = world_sum / N
    else:
        sum_np = np.array( [sum] )
        world_comm.Send( [sum_np, MPI.DOUBLE], dest=0, tag=77 )

MPI provides many collective communication functions, which automate many processes that can be complicated to write out using only point-to-point communication. In particular, the Reduce function allows us to sum a value across all ranks, without all of the above code. Replace the above with:

PYTHON

    sum = np.array( [sum] )
    world_sum = np.zeros( 1 )
    world_comm.Reduce( [sum, MPI.DOUBLE], [world_sum, MPI.DOUBLE], op = MPI.SUM, root = 0 )
    average = world_sum / N

The op argument lets us specify what operation should be performed on all of the data that is reduced. Setting this argument to MPI.SUM, as we do above, causes all of the values to be summed onto the root process. There are many other operations provided by MPI, as you can see here:

Operation	Description	Datatype
`MPI.MAX`	maximum	integer,float
`MPI.MIN`	minimum	integer,float
`MPI.SUM`	sum	integer,float
`MPI.PROD`	product	integer,float
`MPI.LAND`	logical AND	integer
`MPI.BAND`	bit-wise AND	integer,MPI_BYTE
`MPI.LOR`	logical OR	integer
`MPI.BOR`	bit-wise OR	integer,MPI_BYTE
`MPI.LXOR`	logical XOR	integer
`MPI.BXOR`	bit-wise XOR	integer,MPI_BYTE
`MPI.MAXLOC`	max value and location	float
`MPI.MINLOC`	min value and location	float

Note that in addition to enabling us to write simpler-looking code, collective communication operations tend to be faster than what we can achieve by trying to write our own communication operations using point-to-point calls.

MPI Hands-On - mpi4py#

1. Example 1#

Writing Hello World#

Getting Started with MPI#

MPI - `mpiexec` vs `mpirun`

Example 2#

Basic Infrastructure#

Point-to-Point Communication#

Reducing the Memory Footprint#

Collective Communication#

Example 3#

MPI Hands-On - mpi4py#

1. Example 1#

Writing Hello World#

Getting Started with MPI#

MPI - mpiexec vs mpirun

Example 2#

Basic Infrastructure#

Point-to-Point Communication#

Reducing the Memory Footprint#

Collective Communication#

Example 3#

MPI - `mpiexec` vs `mpirun`