--- html_meta: "description lang=en": "This hands-on tutorial on OpenMP provides an overview of using OpenMP to parallelize code. It covers topics such as implementing shared-memory parallelization, identifying and fixing race conditions, and optimizing the performance of OpenMP code. The tutorial includes examples demonstrating the parallelization process step-by-step." "keywords": "OpenMP, parallelization, shared-memory, race condition, optimization, performance, code parallelization, compiler directive, pragma, threads, thread synchronization, fork-join model, parallel for loop, reduction operation, first touch" "property=og:locale": "en_US" --- # OpenMP Hands-On ````{admonition} Overview :class: overview Questions: - How can I use OpenMP to parallelize a code? Objectives: - Use OpenMP to implement shared-memory parallelization. - Learn how to identify and fix race conditions. - Optimize the performance of an OpenMP code. ```` ## Example 1 Download the examples folder for this section [here](https://github.com/MolSSI-Education/parallel-programming/tree/main/examples/omp/) If you open [example1.cpp](https://github.com/MolSSI-Education/parallel-programming/tree/main/examples/omp/example1/example1.cpp) with a text editor, you will see that it is a simple Hello World code. ``` cpp #include int main() { printf("Hello World\n"); return 0; } ``` Go ahead and build the code. ````{tab-set-code} ```{code-block} shell $ cmake . $ make ``` ```` You could run the code just by typing `./example1` into the command line, but there is also a simple script that does the same thing. Run the script now. ````{tab-set-code} ```{code-block} shell $ ./run.sh ``` ```` > Hello World! We will now turn this into a multi-threaded code. To use OpenMP, you will need to add `#include ` to the beginning of your code. Then, you can add OpenMP parallelization through the use of a compiler directive, which is any line that begins with `#pragma` (short for "pragmatic information"). The directive `#pragma omp parallel` tells the compiler that the next block of code should be parallelized using OpenMP. We are going to parallelize the line with the `printf`, so put curly brackets around that line in order to make it a distinct block of text, then put `#pragma omp parallel` before the block. Your code should look like: ``` cpp #include #include int main() { #pragma omp parallel { printf("Hello World\n"); } return 0; } ``` Go ahead and rebuild the code. ````{tab-set-code} ```{code-block} shell $ cmake . $ make ``` ```` Before running the code, we should also specify how many OpenMP threads we want to run on. This is something that you decide at run-time, by setting a special system variable called `OMP_NUM_THREADS`. Add a line to `run.sh` so that it sets `OMP_NUM_THREADS` to 4: ~~~ export OMP_NUM_THREADS=4 ./example1 ~~~ Now let's run the code: ````{tab-set-code} ```{code-block} shell $ ./run.sh ``` ```` >Hello World! >Hello World! >Hello World! >Hello World! The code prints `Hello World!` four times, but obviously this isn't because of any `for` loop. Each of the four OpenMP threads that we specified with `OMP_NUM_THREADS` executes the `printf` call independently. We will now edit the code so that each time `Hello World!` is printed we are told which of the threads is responsible. Open `example1.cpp` and edit code to read: ``` cpp #include #include int main() { #pragma omp parallel { int thread_id = omp_get_thread_num(); printf("Hello World! (%i)\n", thread_id); } return 0 } ``` Compile and run the code: ````{tab-set-code} ```{code-block} shell $ cmake . $ make $ ./run.sh ``` ```` >Hello World! (2) >Hello World! (3) >Hello World! (1) >Hello World! (0) Run the code a few more times, and you will find that the order in which the threads print their message isn't always the same. This is because all of the threads run simultaneously, and depending on largely random factors some will finish slightly before others. The order in which the threads finish determines the order in which the messages are printed. Let's play around with these threads some more. We will add another `printf` after the OpenMP-parallelized block, followed by a third `printf` inside a new OpenMP-parallelized block. ``` cpp #include #include int main() { #pragma omp parallel { int thread_id = omp_get_thread_num(); printf("Hello World! (%i)\n", thread_id); } int thread_id = omp_get_thread_num(); printf("This is another message! (%i)\n", thread_id); #pragma omp parallel num_threads(3) { int thread_id = omp_get_thread_num(); printf("Goodbye World! (%i)\n", thread_id); } return 0; } ``` If you compile and run this code, you should get something like >Hello World! (2) >Hello World! (0) >Hello World! (1) >Hello World! (3) >This is another message! (0) >Goodbye World! (0) >Goodbye World! (1) >Goodbye World! (2) The following happens when the code is run: 1. The code begins executing sequentially, with only the "master thread" running. 2. When the master thread encounters an OpenMP-parallelized block, it `forks`, causing a number of threads equal to OMP_NUM_THREADS to begin executing. 3. At the end of the OpenMP-parallelized block, the threads are all `joined` to the master thread, and sequential execution is resumed. 4. The second `printf` is executed by only the master thread. 5. The master thread again forks, but only creates a total of three threads. 6. All of the threads print a new line, and then are joined again. https://en.wikipedia.org/wiki/Fork%E2%80%93join_model#/media/File:Fork_join.svg ## Example 2 We will now work on [Example 2](https://github.com/MolSSI-Education/parallel-programming/tree/main/examples/omp/example2). ````{tab-set-code} ```{code-block} shell $ cd ../example2 ``` ```` In this directory is a code that does some simple math on some arrays, and then prints the average of the result: ````{tab-set-code} ```{code-block} shell $ cmake. $ make $ ./run.sh ``` ```` >Average: 500000001.500000 We would like to improve the speed of this calculation through parallelization. Before doing any sort of optimizations on a code, we should learn about which parts of the calculation are the most expensive - otherwise, we might waste our time parallelizing a part of the code that already doesn't take long to run, while ignoring much more costly parts of the code. The most basic information we need is the time it takes to run each different part of the code. To learn this, we will add some extra code to time each calculation. Open `example2.cpp` using a text editor, and add the following to the beginning of `main`: ``` cpp double start_time = omp_get_wtime(); ``` Now add the following to the end of main: ``` cpp printf("Total time: %f\n",omp_get_wtime()-start_time); ``` The code should now output the total about of time required to run: >Average: 500000001.500000 >Total time: 6.816225 Add similar timing statements around each of the `for` loops. This should produce output similar to the following: >Initialize a time: 1.402934 >Initialize b time: 4.077723 >Add arrays time: 0.900899 >Average result time: 0.381398 > >Average: 500000001.500000 >Total time: 6.781963 Let's start by parallelizing the loop that adds arrays `a` and `b`. ``` cpp for (int i=0; iAdd arrays time: 0.348737 A much easier way to parallelize this loop is to say: ``` cpp #pragma omp parallel for for (int i=0; iAdd arrays time: 0.328928 Now we will parallelize the loop that averages the result. ``` cpp #pragma omp parallel for for (int i=0; iInitialize a time: 1.397873 >Initialize b time: 4.076062 >Add arrays time: 0.342933 >Average result time: 0.123558 > >Average: 156250000.375000 >Total time: 5.960761 This made the loop finish more quickly, but it also changed the final result. What happened? A strong hint comes from the fact that this result is the same thing we would get if we summed only the first 1/4 of the elements in array `a`. The problem is that each thread works on its own copy of the `average` variable. We want the loop to calculate the total value of `average`, summed across all threads. This is known as a reduction operation, and we can tell the compiler that this is what we want by adding a `reduction` clause to the `pragma`: ``` cpp #pragma omp parallel for reduction(+:average) for (int i=0; iInitialize a time: 1.383079 >Initialize b time: 4.074402 >Add arrays time: 0.346857 >Average result time: 0.123527 > >Average: 500000001.500000 >Total time: 5.948352 We've made some nice improvements to a couple of the loops, but the real bottlenect happens when `a` and `b` are initialized. We will now work on the loop that initializes `a`. Start by adding an OpenMP pragma: ``` cpp #pragma omp parallel for for (int i=0; iInitialize a time: 0.391194 >Initialize b time: 4.439494 >Add arrays time: 0.296057 >Average result time: 0.105363 > >Average: 500000001.500000 >Total time: 5.252643 Now do the same thing to b: ``` cpp #pragma omp parallel for for (int i=0; iInitialize a time: 0.344843 >Initialize b time: 0.350311 >Add arrays time: 0.188084 >Average result time: 0.085440 > >Average: 500000001.500000 >Total time: 0.987741 Notice that now even the last two loops are faster. This is because of the principle of `first touch`. ## Example 3 Now we will look at [Example 3](https://github.com/MolSSI-Education/parallel-programming/tree/main/examples/omp/example3). ````{tab-set-code} ```{code-block} shell $ cd ../example3 ``` ```` Compiling and running this code should produce something like the following: ````{tab-set-code} ```{code-block} shell $ cmake . $ make $ ./run.sh ``` ```` >Iteration: 999 Energy: 92079.129718 PE: 16253.127101 > >Timings: > Force Zero: 0.000707 > Force Calc: 15.992926 > Velocity Update: 0.014128 > Coords Update: 0.001717 > Total: 16.014781 Clearly, the most expensive part of the code is the region where the forces are calculated, which consists of a double loop over all particles. ``` cpp start_loop = omp_get_wtime(); double v = 0.0; for (int i=0; i < natoms; i++) { for (int j=i+1; j < natoms; j++) { double dx = coords[j][0] - coords[i][0]; double dy = coords[j][1] - coords[i][1]; double dr2 = dx*dx + dy*dy; double dr = sqrt(dr2); double fx = (dx/dr)*(1.0/dr2); double fy = (dy/dr)*(1.0/dr2); forces[i][0] -= fx; forces[i][1] -= fy; forces[j][0] += fx; forces[j][1] += fy; v += 1.0/dr; } } *potential = v; ``` One approach to speeding this code up would be to add OpenMP parallelization to the inner loop, like this: ``` cpp for (int i=0; i < natoms; i++) { #pragma omp parallel for reduction(+:v) for (int j=i+1; j < natoms; j++) { ... } } ``` >Iteration: 999 Energy: 286004.735108 PE: 18173.082058 > >Timings: > Force Zero: 0.001293 > Force Calc: 14.219656 > Velocity Update: 0.014499 > Coords Update: 0.005649 > Total: 14.246469 Unfortunately, this change causes the simulation to no longer produce the same results! To understand why the results are different, look closely at how this loop updates the forces. In the innermost loop is the line `forces[i][0] -= fx;`, which tells the computer "Copy the value of `forces[j][0]` from memory and store it in cache, then subtract `fx` from it, then replace the value of `forces[i][0]` in memory with the result. When we run with multiple threads, different threads may be attempting to update 'forces[i][0]' at the same time. If one thread copies the value of `forces[i][0]` into cache, then another thread updates it while the first thread is doing the subtration operation, the outcome will be that the first thread overwrites the update. This scenario is known as a `race condition`, and happens any time two threads try to edit the same data at the same time. If you run this calculation several times, you will get slightly different results each time depending on the order in which the threads "race" to update the forces. One way to eliminate the race condition is to switch from looping over "unique pairs" to looping over "all pairs." This way, we don't need to update `forces[i][0]` at all, only `forces[j][0]`. The modified code looks like: ``` cpp double v = 0.0; for (int i=0; i < natoms; i++) { #pragma omp parallel for reduction(+:v) for (int j=0; j < natoms; j++) { if (i != j ) { double dx = coords[j][0] - coords[i][0]; double dy = coords[j][1] - coords[i][1]; double dr2 = dx*dx + dy*dy; double dr = sqrt(dr2); double fx = (dx/dr)*(1.0/dr2); double fy = (dy/dr)*(1.0/dr2); //forces[i][0] -= fx; //forces[i][1] -= fy; forces[j][0] += fx; forces[j][1] += fy; v += 0.5/dr; } } } *potential = v; ``` The main differences here are: (1) we changed the starting value of `j` from `i+1` to `0`, (2) we commented out the updates to `forces[i]`, (3) we multiple all contributions to `v` by `0.5` to avoid a double-counting error, and (4) we add an `if` check to avoid the `i == j` case. This code produces the correct result while also being substantially faster. >Iteration: 999 Energy: 92079.129718 PE: 16253.127101 > >Timings: > Force Zero: 0.000792 > Force Calc: 7.954821 > Velocity Update: 0.003412 > Coords Update: 0.005082 > Total: 7.969381 That having been said, we can probably do somewhat better. Every time the code enters an OpenMP-threaded region, the master thread must perform a `fork` (and later a `join`). Because of this, there is an overhead cost every time the code encounters an OpenMP-parallelized region; the exact amount of overhead can vary, but it is usually on the order of 1 microsecond. Think for a moment about how much time the code spends each time it enters our OpenMP region. The total amount of time our code spends calculating the forces is 8.0 seconds, and the code enters the OpenMP region a number of times equal to the number of atoms times the number of iterations. This works out to approximately 5 microseconds spent inside the OpenMP region each time it is called. That isn't ideal - it is likely that a decent amount of that time is just `fork`/`join` overhead. We can improve things by moving the parallelization to the outer loop over `i`. ``` cpp #pragma omp parallel for reduction(+:v) for (int i=0; i < natoms; i++) { for (int j=0; j < natoms; j++) { double dx = coords[j][0] - coords[i][0]; double dy = coords[j][1] - coords[i][1]; double dr2 = dx*dx + dy*dy; double dr = sqrt(dr2); double fx = (dx/dr)*(1.0/dr2); double fy = (dy/dr)*(1.0/dr2); forces[i][0] -= fx; forces[i][1] -= fy; //forces[j][0] += fx; //forces[j][1] += fy; v += 0.5/dr; } } ``` >Iteration: 999 Energy: 92079.129718 PE: 16253.127101 > >Timings: > Force Zero: 0.000737 > Force Calc: 4.770037 > Velocity Update: 0.002868 > Coords Update: 0.005819 > Total: 4.784812 That definitely helped - apparently the code was spending nearly half the time doing OpenMP overhead work. ## Example 4 Now we will work on a somewhat more realistic [example](https://github.com/MolSSI-Education/parallel-programming/tree/main/examples/omp/example4) of an MD code. ````{tab-set-code} ```{code-block} shell $ cd ../example4 $ cmake . $ make $ ./run.sh ``` ```` >Time step 0.03 >optim: 1.84389e+17 0.1 >optim: 11479.9 0.1 >optim: -10384.3 0.1 >optim: -14405.6 0.1 >optim: -16274.2 0.1 >optim: -17473.3 0.1 >optim: -18386.3 0.1 >optim: -19046.1 0.1 >optim: -19516.6 0.1 >optim: -19881.7 0.1 >optim: -20184.1 0.1 >optim: -20445.1 0.1 > > time ke pe e T P > ------- ----------- ------------ ------------ ------ ------ > 3.00 2937.72894 -18197.02586 -15259.29692 0.371 0.04086690 > 6.00 3426.85950 -18452.47277 -15025.61327 0.408 0.04039200 > 9.00 3603.35315 -18693.26130 -15089.90815 0.441 0.03943282 > 12.00 3615.37003 -19040.76407 -15425.39404 0.436 0.03848593 > 15.00 3686.47202 -19111.90579 -15425.43377 0.453 0.03809054 > 18.00 3702.65533 -19128.06154 -15425.40621 0.465 0.03778621 > 21.00 3812.43640 -19237.88713 -15425.45072 0.472 0.03758076 > 24.00 3849.80094 -19275.24791 -15425.44697 0.480 0.03741216 > 27.00 3962.65623 -19388.13553 -15425.47930 0.492 0.03713681 > 30.00 3973.40391 -19398.90927 -15425.50536 0.495 0.03703538 >times: force=14.85s neigh=17.29s total=32.24s The code primarily consists of three sections: (1) The `neighbor_list` function, which generates a list of atom pairs that are close enough to be considered interacting, (2) a `forces` function, which calculates all contributions to the forces from the pairs in the neighbor list, and (3) the `md` function, which runs the calculation by calling `neighbor_list` and `forces` and then updating the atomic coordinates each timestep. Note that the force evaluation is handled via an iterator over the neighbor list: ``` cpp for (neighT::const_iterator ij=neigh.begin(); ij!=neigh.end(); ++ij) { ... } ``` Loops with iterators are a little awkward to parallelize with OpenMP. The approach we will take is to create a separate neighborlist for each thread. First, define a `thrneighT` type, which will be a vector of neighborlists (add this near the beginning of `md.cc`, with the other typdefs). ``` cpp typedef std::vector thrneighT; ``` Then create a `thrneighT` which will contain all of the neighborlists for all of the threads. Where appropriate, change `neighT` and `neigh` to `thrneighT` and `thrneigh`, respectively. Here are the locations where these changes need to be made: ``` cpp void neighbor_list(const coordT& coords, thrneighT& thrneigh) { double start = omp_get_wtime(); for (int ithr=0; ithr 30.00 3973.40320 -19398.90856 -15425.50536 0.495 0.03703538 >times: force=14.50s neigh=18.06s total=32.67s We are finally ready to work on parallelization of the `forces` subroutine. Replacing the outer `for` loop in `forces` is easy enough: ``` cpp #pragma omp parallel default(none) shared(f, thrneigh, coords, virial, pe) { int ithr = omp_get_thread_num(); const neighT& neigh = thrneigh[ithr]; ... } ``` Unfortunately, there is one significant problem: the forces are incremented in a way that introduces a race condition. ``` cpp f[i].first += dfx; f[j].first -= dfx; f[i].second += dfy; f[j].second -= dfy; ``` Unlike the case with `Example 3`, it isn't straightforward to restructure the force evaluation in such a way that the forces on each atom are calculated by only one thread. Instead, we will have each thread calculate some portion of the forces on all of the atoms, then perform a `reduction` operation on the forces. The OpenMP `reduction` clause only works for for basic data types; it won't work for our defined type `coordT`. As a result, we must perform the reduction manually. To reduce the forces, first declare an array that will store the forces for a single thread. ``` cpp #pragma omp parallel default(none) shared(f, thrneigh, coords, virial, pe) { coordT f_thread(natom,xyT(0.0,0.0)); double virial_thread = 0.0; double pe_thread = 0.0; ... f_thread[i].first += dfx; f_thread[j].first -= dfx; f_thread[i].second += dfy; f_thread[j].second -= dfy; pe_thread += vij; virial_thread += dfx*dx + dfy*dy; ... } } } ``` Then, `just before` the end of the OpenMP block, write the following: ``` cpp #pragma omp critical { for (int i=0; i 30.00 3973.40314 -19398.90851 -15425.50536 0.495 0.03703538 >times: force=7.03s neigh=18.99s total=26.23s Now let's work on the `neighbor_list` function. Once again, we will convert the `for` loop over `thrneigh` into an OpenMP parallelized region. ``` cpp #pragma omp parallel default(none) shared(thrneigh, coords) { int ithr = omp_get_thread_num(); int nthread = omp_get_num_threads(); ... } ``` This improves the neighborlist creation time substantially: >times: force=3.67s neigh=6.65s total=10.49s Running on 16 threads, we get this: >times: force=1.58s neigh=2.26s total=4.03s One issue that limits our performance at high thread counts is the fact that we are not doing anything to `load balance` the neighborlist across threads. We can improve the load balancing by adding some code to ensure that each thread has a neighborlist of similar length. Add the following just before the end of the OpenMP parallelized block in `neighbor_list`: ``` cpp #pragma omp barrier #pragma omp single { //get the target number of pairs per thread int npair=0; for (int i=0; i npair) { //donate pairs from the next thread if (thrneigh[i].size() == 0) break; thrneigh[i+1].push_back(thrneigh[i].back()); thrneigh[i].pop_back(); } } } ``` Unfortunately, this really doesn't help our timings: >times: force=1.59s neigh=2.37s total=4.15s The problem is that the added code does lots of operations that resize the neighborlists, which is a slow operation on a C++ `list` type. To fix this, change the `neighT` type to be a vector: ````{tab-set-code} ```{code-block} c++ typedef std::vector neighT; ``` ```` Then, add the following to the `neighbor_list` function, just after the `neigh.clear();` line: ````{tab-set-code} ```{code-block} c++ neigh.reserve(100*natom/nthread); ``` ```` These changes allow us to get our best timings yet: >times: force=1.55s neigh=1.54s total=3.27s ````{admonition} Key Points :class: key - It is extremely important to carefully avoid race conditions. - Achieving good performance with OpenMP often requires larger code refactoring. ````