Introduction to Programming with OpenMP --------------------------------------- Practical Exercise 2 (Basics and Simple SIMD) --------------------------------------------- You should normally start with the following commands: export OMP_NUM_THREADS=4 export OMP_SCHEDULE=static GNU Fortran: gfortran -Wall -Wextra -std=f2003 -pedantic -O3 -fopenmp \ -fcheck=all -ftrapv -ffpe-trap=invalid,zero,overflow \ -o .f90 -llapack -lblas ./ Programs/matrices_f_10 NAG Fortran: nagfor -openmp -O3 -C=all -o \ .f90 -llapack -lblas ./ Programs/matrices_f_10 Note that using '-fcheck=all' or '-C=all' does a lot of runtime checking that is not available for C or C++, but slows down the code considerably. If you want to compare times, or the tests are taking too long, you should omit those, and possibly '-ftrapv' as well. GNU C: gcc -Wall -Wextra -ansi -pedantic -O3 -fopenmp -ftrapv \ -o .c -llapack -lblas ./ Programs/matrices_c_10 GNU C++: g++ -Wall -Wextra -ansi -pedantic -O3 -fopenmp -ftrapv \ -o .cpp -llapack -lblas ./ Programs/matrices_c_10 WARNING: remember that the libraries may need to go before the program on some systems. See under gfortran about '-ftrapv' and timing. When you have got the changes working, then increase the size of the matrices to see how your tuning is going, by increasing the 10 at the end of the argument to 1000. You can also use 100 as an intermediate step, if it helps. If you have more cores, you can run with more than 4 threads, and this will help to expose any race conditions you have coded (i.e. increase the 4 in the setting of OMP_NUM_THREADS) and make the timing effects clearer; all of the examples should work with any number. Regrettably, this will not show you parallelisation on the PWF/MCS/DS, as they are all low core-count systems. Question 1 ---------- In doing this question, declare every variable and array that is used in any of the loops inside the DO/for directive blocks as either shared or private. Declare all loop indices and temporary variables as private, and the main arrays as shared. More details are given on this in the next lecture. 1.1 Starting with Programs/Multiply.f90 or Programs/Multiply.c, add directives to parallelise the outer loop in procedure Multiply, using a combined parallel and DO/for directive. Make sure that you declare ALL variables used as shared or private, even though the default ones do what you want. 1.2 Change the program to use the transpose of the first matrix. In Fortran, that needs only the addition of a TRANSPOSE call in the MATMUL, the first argument of DGEMM changing to 't' and the sections reversing in the DOT_PRODUCT call. In C, it needs only the index expressions reversing in the innermost loop and the first argument of DGEMM changing to 't'. Notice that the output shows different values, and the coded multiply now runs a lot faster. That is because it is more cache-friendly. 1.3 Change the program to split the directives into separate parallel and DO/for ones. Put the former around the call to Multiply and leave the latter where it is. Remember to declare all variables as shared or private, even if the default is what you want, except that you will have to leave the shared declarations off the DO/for directive (see the next lecture for why). You should notice no difference in the values or the times. 1.4 This example is to show some of the effects of including serial code in parallel blocks, and why it is easier to use the combined directives. Change the program to move the parallel directive to outside the while timing code (i.e. from just before the first call to Times to just after the last call to Check). Don't both with declaring anything as shared, as that is the default here. Doing that may well go horribly wrong, and is always incorrect. Note that, even when this example appears to work, an excessive number of lines of output appear. It is equally likely to crash in an unpredictable way.