Introduction to Programming with OpenMP --------------------------------------- Practical Exercise 1 (Background and Principles) ------------------------------------------------ This practical just tries out existing code. Note that there are no specimen answers. The multiplication code prints out some checking values, but you will have to look at the code to see what they mean. The Cholesky solver prints out the accuracy of the result, which should be small - but note that the theoretical accuracy of this problem is the dimension of the matrix times the machine precision! You should use the basic commands: Fortran: gfortran -Wall -Wextra -std=f2003 -pedantic -O3 -o Multiply \ Programs/Multiply.f90 -lblas ./Multiply Programs/matrices_f_10 and: gfortran -Wall -Wextra -std=f2003 -pedantic -O3 -o Cholesky \ Programs/Cholesky.f90 -llapack ./Cholesky Programs/matrices_f_10 C: gcc -Wall -Wextra -ansi -pedantic -O3 -o Multiply \ Programs/Multiply.c -lblas ./Multiply Programs/matrices_c_10 and: gcc -Wall -Wextra -ansi -pedantic -O3 -o Cholesky \ Programs/Cholesky.c -llapack ./Cholesky Programs/matrices_c_10 WARNING: requiring the library flags to be placed after the program name is a recent lunacy by POSIX/GNU; they were originally required to be before, and for two decades GNU allowed them in either place. Some other systems may still require them before, as in: gcc -Wall -Wextra -ansi -pedantic -O3 -o Cholesky \ -llapack Programs/Cholesky.c You are not supplied with larger versions of the matrix data, because they take up too much space, and must generate them. This is how you generate the 1000x1000 size (other sizes are similar): gcc -o Generate Programs/Generate.c gfortran -o Transform Programs/Transform.f90 ./Generate 1000 matrices Question 1 ---------- 1.1 Just try the above commands to check that they work, then change the Programs/matrices_f_10 to Programs/matrices_f_1000 and Programs/matrices_c_10 to Programs/matrices_c_1000, and rerun. Though it is not shown here, the original, unexpanded, form of the Fortran code (as shown in the Fortran course) ran a LOT faster. That was expanded into DO-loop form for the purposes of this course, but old (Fortran 77 style) code is NOT the best way to program nowadays! 1.2 Now try adding optimisation (-O3) and OpenMP (-fopenmp). If you are using a compiler with autoparallelisation, then add that, but neither gfortran nor gcc have it (Intel's compilers do). That means that the -fopenmp has no effect unless the code includes OpenMP directives. Notice that it has no effect on the library calls, and (with gfortran and gcc) produces a negligible improvement except for the Cholesky solver. That is mainly because the code is fairly efficient, but a couple of simple optimisations help a lot. 1.3 Now try adding a tuned library (say, -lacml or -lmkl_rt) instead of -lblas and -llapack, but use the following command first: export OMP_NUM_THREADS=1 Notice the vast improvement in the library code's times. This is because they used blocked algorithms, which optimise cache usage. 1.4 Now try increasing the number of threads by changing -lacml to -lacml_mp (if you are using that) and using the following command: export OMP_NUM_THREADS=4 Ignore minor variations and discrepancies, as they are normal; computer timing is always a little unpredictable. Notice how the wall-clock time for the tuned library code drops, though the CPU time does not. That is your objective when writing and tuning OpenMP code. Regrettably, this effect will NOT occur on the PWF/MCS/DS, as the systems have only one or two cores each. Results ------- On one computer, I got the following figures. Multiply Wall-clock Time: MATMUL DGEMM Fortran C time Basic 1.81 2.22 18.30 17.81 -O3 -fopenmp 1.81 2.22 16.44 16.42 Adding -lacml 2.82 0.28 16.50 16.46 With 4 threads 2.81 0.08 16.43 16.37 LAPACK Solver Wall-clock Time: LAPACK Fortran C time Basic 2.28 15.09 12.22 -O3 -fopenmp 2.24 2.22 2.35 Adding -lacml 0.31 2.28 2.34 With 4 threads 0.09 2.21 2.35