--- html_meta: "description lang=en": "A detailed guide on how to use MPI in C++ for parallelizing code, handling errors, using non-blocking communication methods, and debugging parallelized code. Ideal for developers and programmers involved in high-performance computing. Three hands-on examples including parallelization of a Monte Carlo simulation code." "keywords": "Parallel Programming, MPI, Python, mpi4py, MPI-parallelized code, Message Passing Interface, MPI Standard, MPI operations, NumPy, MPI Timer, MPI Communicator, MPI Ranks, OpenMPI, MPICH, MS MPI, mpiexec, mpirun, monte carlo simulation" "property=og:locale": "en_US" --- # MPI Hands-On - C++ ````{admonition} Overview :class: overview Questions: - How can I use MPI to parallelize a compiled code? Objectives: - Compile and run C++ codes that are parallelized using MPI. - Use proper MPI error handling. - Learn how to use non-blocking communication methods. - Use a debugger with an parallelized code. ```` ## 1. Example 1 ### Writing Hello World We'll start with the first example in [mpi/hello](https://github.com/MolSSI-Education/parallel-programming/tree/main/examples/mpi/hello), which is a simple Hello World code: ````{tab-set-code} ```{code-block} cpp #include int main(int argc, char **argv) { std::cout << "Hello World!" << std::endl; return 0; } ``` ```` Acquire a copy of the example files for this lesson, and then compile and run the example code: ````{tab-set-code} ```{code-block} shell $ git clone git@github.com:MolSSI-Education/parallel-programming.git $ cd parallel-programming/examples/mpi/hello $ mkdir build $ cd build $ cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_Fortran_COMPILER=mpifort .. $ make $ ./hello ``` ```` ````{tab-set-code} ```{code-block} output Hello World! ``` ```` ### Getting Started with MPI Let's try running this code on multiple processes. This is done using the `mpiexec` command. Many environments also provide an `mpirun` command, which usually - but not always - works the same way. Whenever possible, you should use `mpiexec` and not `mpirun`, in order to guarantee more consistent results. > ## MPI - `mpiexec` vs `mpirun` > MPI stands for **'message passing interface'** and is a message passing standard which is designed to work on a variety of parallel computing architectures. The [MPI standard](https://www.mpi-forum.org/docs/drafts/mpi-2018-draft-report.pdf) defines how syntax and semantics of a library of routines. There are a number of implementations of this standard including OpenMPI, MPICH, and MS MPI. > > The primary difference between `mpiexec` and `mpirun` is that `mpiexec` is defined as part of the MPI standard, while `mpirun` is not. Different implementations of MPI (i.e. OpenMPI, MPICH, MS MPI, etc.) are not guaranteed to implement `mpirun`, or might implement different options for `mpirun`. Technically, the MPI standard doesn't actually require that MPI implementations implement `mpiexec` either, but the standard does at least describe guidelines for how `mpiexec` should work. Because of this, `mpiexec` is generally the preferred command. > The general format for lanching a code on multiple processes is: ````{tab-set-code} ```{code-block} shell $ mpiexec -n ``` ```` For example, to launch `hello` on 4 processes, do: ````{tab-set-code} ```{code-block} shell $ mpiexec -n 4 ./hello ``` ```` ````{tab-set-code} ```{code-block} output Hello World! Hello World! Hello World! Hello World! ``` ```` When you execute the above command, `mpiexec` launches 4 different instances of `./hello` simultaneously, which each print "Hello World!". Typically, as long as you have at least 4 processors on the machine you are running on, each process will be launched on a different processor; however, certain environment variables and optional arguments to `mpiexec` can change this behavior. Each process runs the code in `hello` independently of the others. It might not be obvious yet, but the processes `mpiexec` launches aren't completely unaware of one another. The `mpiexec` adds each of the processes to an MPI communicator, which enables each of the processes to send and receive information to one another via MPI. The MPI communicator that spans all of the processes launched by `mpiexec` is called `MPI_COMM_WORLD`. We can use the MPI library to get some information about the `MPI_COMM_WORLD` communicator and the processes within it. Edit `hello.cpp` so that it reads as follows: ````{tab-set-code} ```{code-block} cpp #include #include int main(int argc, char **argv) { // Initialize MPI // This must always be called before any other MPI functions MPI_Init(&argc, &argv); // Get the number of processes in MPI_COMM_WORLD int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); // Get the rank of this process in MPI_COMM_WORLD int my_rank; MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); // Print out information about MPI_COMM_WORLD std::cout << "World Size: " << world_size << " Rank: " << my_rank << std::endl; // Finalize MPI // This must always be called after all other MPI functions MPI_Finalize(); return 0; } ``` ```` Recompile the code: ````{tab-set-code} ```{code-block} shell $ make ``` ```` In the above code we first include the MPI library header, `mpi.h`. Then, we call `MPI_Init()`. This function **must** be called before any other MPI functions, and is typically one of the first lines of an MPI-parallelized code. Then, we call `MPI_Comm_size()` to get the number of processes in `MPI_COMM_WORLD`, which corresponds to the number of processes launched whenever `mpiexec` is executed at the command line. Each of these processes is assigned a uniqe rank, which is an integer that ranges from `0` to `world_size - 1`. The rank of a process allows it to be identified whenever processes communicate with one another. For example, in some cases we might want rank 2 to send some information to rank 4, or we might want rank 0 to receive information from all of the other processes. Calling `MPI_Comm_rank()` returns the rank of the process calling it within `MPI_COMM_WORLD`. Go ahead and run the code now: ````{tab-set-code} ```{code-block} shell $ mpiexec -n 4 ./hello ``` ```` ````{tab-set-code} ```{code-block} output World Size: 4 Rank: 1 World Size: 4 Rank: 0 World Size: 4 Rank: 2 World Size: 4 Rank: 3 ``` ```` As you can see, the `MPI_Comm_size()` function outputs 4, which is the total number of ranks we told `mpiexec` to run with (through the `-n` argument). Each of the processes is assigned a rank in the range of 0 to 3. As you can see, the ranks don't necessarily print out their messages in order; whichever rank completes the `cout` first will print out its message first. If you run the code again, the ranks are likely to print thier message in a different order: ````{tab-set-code} ```{code-block} output World Size: 4 Rank: 2 World Size: 4 Rank: 0 World Size: 4 Rank: 3 World Size: 4 Rank: 1 ``` ```` You can also try rerunning with a different value for the `-n` `mpiexec` argument. For example: ````{tab-set-code} ```{code-block} shell $ mpiexec -n 2 ./hello ``` ```` ````{tab-set-code} ```{code-block} output World Size: 2 Rank: 0 World Size: 2 Rank: 1 ``` ```` ### Error Handling with MPI If an error forces an MPI program to exit, it should **never** just call `return` or `exit`. This is because calling `return` or `exit` might terminate one of the MPI processes, but leave others running (but not doing anything productive) indefintely. If you're not careful, you could waste massive amounts of computational resources running a failed calculation your thought had terminated. Instead, MPI-parallelized codes should call `MPI_Abort()` when something goes wrong, as this function will ensure all MPI processes terminate. The `MPI_Abort` function takes two arguments: the first is the communicator corresponding to the set of MPI processes to terminate (this should generally be `MPI_COMM_WORLD`), while the second is an error code that will be returned to the environment. It is also useful to keep in mind that most MPI functions have a return value that indicates whether the function completed succesfully. If this value is equal to `MPI_SUCCESS`, the function was executed successfully; otherwise, the function call failed. By default, MPI functions automatically abort if they encounter an error, so you'll only ever get a return value of `MPI_SUCCESS`. If you want to handle MPI errors yourself, you can call `MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN)`; if you do this, you must check the return value of every MPI function and call `MPI_Abort` if it is not equal to `MPI_SUCCESS`. For example, when initializing MPI, you might do the following: ````{tab-set-code} ```{code-block} cpp if (MPI_Init(&argc,&argv) != MPI_SUCCESS) MPI_Abort(MPI_COMM_WORLD, 1); ``` ```` ## Example 2 ### Basic Infrastructure We will now do some work with the the example in [examples/mpi/average](https://github.com/MolSSI-Education/parallel-programming/tree/main/examples/mpi/average), which does some simple math. Run the code now. ````{tab-set-code} ```{code-block} shell $ cd parallel-programming/examples/mpi/average $ mkdir build $ cd build $ cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_Fortran_COMPILER=mpifort .. $ make $ ./average ``` ```` ````{tab-set-code} ```{code-block} output Average: 100000001.5 ``` ```` Let's learn something about which parts of this code account for most of the run time. MPI provides a timer, `MPI_Wtime()`, which returns the current walltime. We can use this function to determine how long each section of the code takes to run. For example, to determine how much time is spent initializing array `a`, do the following: ````{tab-set-code} ```{code-block} cpp // Initialize a double start_time = MPI_Wtime(); double *a = new double[N]; for (int i=0; i #include int main(int argc, char **argv) { // Initialize MPI int world_size, my_rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &world_size); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); int N = 200000000; // Initialize a double start_time = MPI_Wtime(); double *a = new double[N]; for (int i=0; i= 0`) --- `MPI_ANY_TAG` matches any tag * `comm` --- the communicator to use * `status` --- pointer to the structure in which to store status We need to decide what parts of the arrays each of the ranks will work on; this is more generally known as a rank's workload. Add the following code just before the initialization of array `a`: ````{tab-set-code} ```{code-block} cpp // Determine the workload of each ran int workloads[world_size]; for (int i=0; i