MPI Hands-On - C++#

Overview

Questions:

How can I use MPI to parallelize a compiled code?

Objectives:

Compile and run C++ codes that are parallelized using MPI.
Use proper MPI error handling.
Learn how to use non-blocking communication methods.
Use a debugger with an parallelized code.

Example 2#

Basic Infrastructure#

We will now do some work with the the example in examples/mpi/average, which does some simple math. Run the code now.

SHELL

$ cd parallel-programming/examples/mpi/average
$ mkdir build
$ cd build
$ cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_Fortran_COMPILER=mpifort ..
$ make
$ ./average

OUTPUT

Average: 100000001.5

Let’s learn something about which parts of this code account for most of the run time. MPI provides a timer, MPI_Wtime(), which returns the current walltime. We can use this function to determine how long each section of the code takes to run.

For example, to determine how much time is spent initializing array a, do the following:

CPP

  // Initialize a
  double start_time = MPI_Wtime();
  double *a = new double[N];
  for (int i=0; i<N; i++) {
    a[i] = 1.0;
  }
  double end_time = MPI_Wtime();
  if (my_rank == 0 ) {
    std::cout << "Initialize a time: " << end_time - start_time << std::endl;
  }

As the above code indicates, we don’t really want every rank to print the timings, since that could look messy in the output. Instead, we have only rank 0 print this information. Of course, this requires that we add a few lines near the top of the code to initialize MPI and query the rank of each process:

CPP

  // Initialize MPI
  int world_size, my_rank;
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

Also determine and print the timings of each of the other sections of the code: the intialization of array b, the addition of the two arrays, and the final averaging of the result. Your code should look something like this:

CPP

#include <iostream>
#include <mpi.h>

int main(int argc, char **argv) {
  // Initialize MPI
  int world_size, my_rank;
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

  int N = 200000000;

  // Initialize a
  double start_time = MPI_Wtime();
  double *a = new double[N];
  for (int i=0; i<N; i++) {
    a[i] = 1.0;
  }
  double end_time = MPI_Wtime();
  if (my_rank == 0 ) {
    std::cout << "Initialize a time: " << end_time - start_time << std::endl;
  }

  // Initialize b
  start_time = MPI_Wtime();
  double *b = new double[N];
  for (int i=0; i<N; i++) {
    b[i] = 1.0 + double(i);
  }
  end_time = MPI_Wtime();
  if (my_rank == 0 ) {
    std::cout << "Initialize b time: " << end_time - start_time << std::endl;
  }
  
  // Add the two arrays
  start_time = MPI_Wtime();
  for (int i=0; i<N; i++) {
    a[i] = a[i] + b[i];
  }
  end_time = MPI_Wtime();
  if (my_rank == 0 ) {
    std::cout << "Add arrays time: " << end_time - start_time << std::endl;
  }

  // Average the result
  start_time = MPI_Wtime();
  double average = 0.0;
  for (int i=0; i<N; i++) {
    average += a[i] / double(N);
  }
  end_time = MPI_Wtime();
  if (my_rank == 0 ) {
    std::cout << "Average result time: " << end_time - start_time << std::endl;
  }

  std::cout.precision(12);
  if (my_rank == 0 ) {
    std::cout << "Average: " << average << std::endl;
  }
  delete [] a;
  delete [] b;
  MPI_Finalize();
  return 0;
}

Now compile and run the code again:

SHELL

$ make
$ ./average

OUTPUT

Initialize a time: 0.544075
Initialize b time: 0.624939
Add arrays time: 0.258915
Average result time: 0.266418
Average: 100000001.5

Point-to-Point Communication#

You can try running this on multiple ranks now:

SHELL

$ mpiexec -n 4 ./average

OUTPUT

Initialize a time: 0.640894
Initialize b time: 0.893775
Add arrays time: 1.38309
Average result time: 0.330192
Average: 100000001.5

Running on multiple ranks doesn’t help with the timings, because each rank is duplicating all of the same work. In some ways, running on multiple ranks makes the timings worse, because all of the processes are forced to compete for the same computational resources. Memory bandwidth in particular is likely a serious problem due to the extremely large arrays that must be accessed and manipulated by each process. We want the ranks to cooperate on the problem, with each rank working on a different part of the calculation. In this example, that means that different ranks will work on different parts of the arrays a and b, and then the results on each rank will be summed across all the ranks.

In this section, we will handle the details of the communication between processes using point-to-point communication. Point-to-point communication involves cases in which a code explicitly instructs one specific process to send/recieve information to/from another specific process. The primary functions associated with this approach are MPI_Send() and MPI_Recv(), which are involve the following arguments:

CPP

int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest, int tag,  MPI_Comm comm);

buf — pointer to the start of the buffer being sent
count — number of elements to send
datatype — MPI data type of each element
dest — rank of destination process
tag — message tag
comm — the communicator to use

CPP

int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status);

buf — pointer to the start of the buffer to receive the message
count — maximum number of elements the buffer can hold
datatype — MPI data type of each element
source — rank of source process — MPI_ANY_SOURCE matches any process
tag — message tag (integer >= 0) — MPI_ANY_TAG matches any tag
comm — the communicator to use
status — pointer to the structure in which to store status

We need to decide what parts of the arrays each of the ranks will work on; this is more generally known as a rank’s workload. Add the following code just before the initialization of array a:

CPP

  // Determine the workload of each ran
  int workloads[world_size];
  for (int i=0; i<world_size; i++) {
    workloads[i] = N / world_size;
    if ( i < N % world_size ) workloads[i]++;
  }
  int my_start = 0;
  for (int i=0; i<my_rank; i++) {
    my_start +=	workloads[i];
  }
  int my_end = my_start	+ workloads[my_rank];

In the above code, my_start and my_end represent the range over which each rank will perform mathematical operations on the arrays.

We’ll start by parallelizing the code that averages the result. Update the range of the for loop in this part of the code to the following:

CPP

  for (int i=my_start; i<my_end; i++) {

This will ensure that each rank is only calculating elements my_start through my_end of the sum. We then need the ranks to communicate their individually calculated sums so that we can calculate the global sum. To do this, add the following immediately after the end of the for loop:

CPP

  if ( my_rank == 0 ) {
    for (int i=1; i<world_size; i++) {
      double partial_average;
      MPI_Status status;
      MPI_Recv( &partial_average, 1, MPI_DOUBLE, i, 77, MPI_COMM_WORLD, &status );
      average += partial_average;
    }
  }
  else {
    MPI_Send( &average, 1, MPI_DOUBLE, 0, 77, MPI_COMM_WORLD );
  }

The MPI_DOUBLE parameter tells MPI what type of information is being communicated by the Send and Recv calls. In this case, we are sending a array of double precision numbers. If you are communicating information of a different datatype, consult the following:

MPI data type	C data type
`MPI_BYTE`	8 binary digits
`MPI_CHAR`	char
`MPI_UNSIGNED_CHAR`	unsigned char
`MPI_SHORT`	signed short int
`MPI_UNSIGNED_SHORT`	unsigned short int
`MPI_INT`	signed int
`MPI_UNSIGNED`	unsigned int
`MPI_LONG`	signed long int
`MPI_UNSIGNED_LONG`	unsigned long int
`MPI_FLOAT`	float
`MPI_DOUBLE`	double
etc.
`MPI_PACKED`	define your own with
	`MPI_Pack`/`MPI_Unpack`

Now compile and run the code again:

SHELL

$ make
$ mpiexec -n 4 ./average

OUTPUT

Initialize a time: 0.63251
Initialize b time: 1.31379
Add arrays time: 1.89099
Average result time: 0.100575
Average: 100000001.5

You can see that the amount of time spent calculating the average has indeed gone down.

Parallelizing the part of the code that adds the two arrays is much easier. All you need to do is update the range over which the for loop iterates:

CPP

  for (int i=my_start; i<my_end; i++) {

Now compile and run the code again:

SHELL

$ make
$ mpiexec -n 4 ./average

OUTPUT

Initialize a time: 0.636685
Initialize b time: 1.66542
Add arrays time: 0.466888
Average result time: 0.0871116
Average: 100000001.5

The array addition time has gone down nicely. Surprisingly enough, the most expensive part of the calculation is now the initialization of the arrays a and b. Updating the range over which those loops iterate speeds up those parts of the calation:

CPP

  // Initialze a
  for (int i=my_start; i<my_end; i++) {
...
  // Initialize b
  for (int i=my_start; i<my_end; i++) {

SHELL

$ make
$ ./average

OUTPUT

Initialize a time: 0.159471
Initialize b time: 0.183946
Add arrays time: 0.193497
Average result time: 0.0847806
Average: 100000001.5

Reducing the Memory Footprint#

The simulation is running much faster now thanks to the parallelization we have added. If that’s all we care about, we could stop working on the code now. In reality, though, time is only one resource we should be concerned about. Another resource that is often even more important is memory. The changes we have made to the code make it run faster, but don’t decrease its memory footprint in any way: each rank allocates arrays a and b with N double precision values. That means that each rank allocates 2*N double precision values; across all of our ranks, that corresponds to a total of 2*nproc*world_size double precision values. Running on more processors might decrease our run time, but it increases our memory footprint!

Of course, there isn’t really a good reason for each rank to allocate the entire arrays of size N, because each rank will only ever use values within the range of my_start to my_end. Let’s modify the code so that each rank allocates a and b to a size of workloads[my_rank].

Replace the initialization of a with:

CPP

  double *a = new double[ workloads[my_rank] ];
  for (int i=0; i<workloads[my_rank]; i++) {
    a[i] = 1.0;
  }

Replace the initialization of b with:

CPP

  double *b = new double[ workloads[my_rank] ];
  for (int i=0; i<workloads[my_rank]; i++) {
    b[i] = 1.0 + double(i + my_start);
  }

Replace the range of the loops that add and average the arrays to for (int i=0; i<workloads[my_rank]; i++).

Now compile and run the code again:

SHELL

$ make
$ ./average

OUTPUT

Initialize a time: 0.16013
Initialize b time: 0.176896
Add arrays time: 0.190774
Average result time: 0.0871552
Average: 100000001.5

Collective Communication#

Previously, we used point-to-point communication (i.e. MPI_Send and MPI_Recv) to sum the results across all ranks:

CPP

  if ( my_rank == 0 ) {
    for (int i=1; i<world_size; i++) {
      double partial_average;
      MPI_Status status;
      MPI_Recv( &partial_average, 1, MPI_DOUBLE, i, 77, MPI_COMM_WORLD, &status );
      average += partial_average;
    }
  }
  else {
    MPI_Send( &average, 1, MPI_DOUBLE, 0, 77, MPI_COMM_WORLD );
  }

MPI provides many collective communication functions, which automate many processes that can be complicated to write out using only point-to-point communication. One particularly useful collective communication function is MPI_Reduce():

CPP

  int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
                 MPI_Op op, int root, MPI_Comm comm)

sendbuf — address of send buffer
recvbuf — address of receive buffer
count — number of elements in send buffer
datatype — MPI data type of each element
op — reduce operation
root — rank of root process
comm — the communicator to use

Possible values for op are:

Operation	Description	Datatype
`MPI_MAX`	maximum	integer,float
`MPI_MIN`	minimum	integer,float
`MPI_SUM`	sum	integer,float
`MPI_PROD`	product	integer,float
`MPI_LAND`	logical AND	integer
`MPI_BAND`	bit-wise AND	integer,MPI_BYTE
`MPI_LOR`	logical OR	integer
`MPI_BOR`	bit-wise OR	integer,MPI_BYTE
`MPI_LXOR`	logical XOR	integer
`MPI_BXOR`	bit-wise XOR	integer,MPI_BYTE
`MPI_MAXLOC`	max value and location	float
`MPI_MINLOC`	min value and location	float

We will use the MPI_Reduce() function to sum a value across all ranks, without all of the point-to-point communication code we needed earlier. Replace all of your point-to-point communication code above with:

CPP

  double partial_average = average;
  MPI_Reduce(&partial_average, &average, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

Compiling and running with this change should produce the same results as before.

Note that in addition to enabling us to write simpler-looking code, collective communication operations tend to be faster than what we can achieve by trying to write our own communication operations using point-to-point calls.

MPI Hands-On - C++#

1. Example 1#

Writing Hello World#

Getting Started with MPI#

MPI - `mpiexec` vs `mpirun`

Error Handling with MPI#

Example 2#

Basic Infrastructure#

Point-to-Point Communication#

Reducing the Memory Footprint#

Collective Communication#

Example 3#

MPI Hands-On - C++#

1. Example 1#

Writing Hello World#

Getting Started with MPI#

MPI - mpiexec vs mpirun

Error Handling with MPI#

Example 2#

Basic Infrastructure#

Point-to-Point Communication#

Reducing the Memory Footprint#

Collective Communication#

Example 3#

MPI - `mpiexec` vs `mpirun`