Introduction to Programming with OpenMP
---------------------------------------

Practical Exercise 1 (Background and Principles)
------------------------------------------------
    
This practical just tries out existing code.  Note that there are no
specimen answers.  The multiplication code prints out some checking
values, but you will have to look at the code to see what they mean.
The Cholesky solver prints out the accuracy of the result, which should
be small - but note that the theoretical accuracy of this problem is the
dimension of the matrix times the machine precision!

You should use the basic commands:

Fortran:

    gfortran -Wall -Wextra -std=f2003 -pedantic -O3 -o Multiply \
        Programs/Multiply.f90 -lblas
    ./Multiply Programs/matrices_f_10
and:
    gfortran -Wall -Wextra -std=f2003 -pedantic -O3 -o Cholesky \
        Programs/Cholesky.f90 -llapack
    ./Cholesky Programs/matrices_f_10

C:

    gcc -Wall -Wextra -ansi -pedantic -O3 -o Multiply \
        Programs/Multiply.c -lblas
    ./Multiply Programs/matrices_c_10
and:
    gcc -Wall -Wextra -ansi -pedantic -O3 -o Cholesky \
        Programs/Cholesky.c -llapack
    ./Cholesky Programs/matrices_c_10

WARNING: requiring the library flags to be placed after the program name
is a recent lunacy by POSIX/GNU; they were originally required to be
before, and for two decades GNU allowed them in either place.  Some
other systems may still require them before, as in:

    gcc -Wall -Wextra -ansi -pedantic -O3 -o Cholesky \
        -llapack Programs/Cholesky.c

You are not supplied with larger versions of the matrix data, because
they take up too much space, and must generate them.  This is how you
generate the 1000x1000 size (other sizes are similar):

    gcc -o Generate Programs/Generate.c
    gfortran -o Transform Programs/Transform.f90
    ./Generate 1000 matrices


Question 1
----------

1.1 Just try the above commands to check that they work, then change the
Programs/matrices_f_10 to Programs/matrices_f_1000 and
Programs/matrices_c_10 to Programs/matrices_c_1000, and rerun.

Though it is not shown here, the original, unexpanded, form of the
Fortran code (as shown in the Fortran course) ran a LOT faster.  That
was expanded into DO-loop form for the purposes of this course, but
old (Fortran 77 style) code is NOT the best way to program nowadays!


1.2 Now try adding optimisation (-O3) and OpenMP (-fopenmp).  If you are
using a compiler with autoparallelisation, then add that, but neither
gfortran nor gcc have it (Intel's compilers do).  That means that
the -fopenmp has no effect unless the code includes OpenMP directives.

Notice that it has no effect on the library calls, and (with gfortran
and gcc) produces a negligible improvement except for the Cholesky
solver.  That is mainly because the code is fairly efficient, but a
couple of simple optimisations help a lot.


1.3 Now try adding a tuned library (say, -lacml or -lmkl_rt) instead
of -lblas and -llapack, but use the following command first:

    export OMP_NUM_THREADS=1

Notice the vast improvement in the library code's times.  This is
because they used blocked algorithms, which optimise cache usage.


1.4 Now try increasing the number of threads by changing -lacml
to -lacml_mp (if you are using that) and using the following command:

    export OMP_NUM_THREADS=4

Ignore minor variations and discrepancies, as they are normal; computer
timing is always a little unpredictable.  Notice how the wall-clock time
for the tuned library code drops, though the CPU time does not.  That is
your objective when writing and tuning OpenMP code.

Regrettably, this effect will NOT occur on the PWF/MCS/DS, as the
systems have only one or two cores each.


Results
-------

On one computer, I got the following figures.

Multiply Wall-clock Time:

                 MATMUL       DGEMM        Fortran      C time
Basic              1.81         2.22         18.30        17.81
-O3 -fopenmp       1.81         2.22         16.44        16.42
Adding -lacml      2.82         0.28         16.50        16.46
With 4 threads     2.81         0.08         16.43        16.37

LAPACK Solver Wall-clock Time:

                 LAPACK       Fortran      C time
Basic              2.28        15.09         12.22
-O3 -fopenmp       2.24         2.22          2.35
Adding -lacml      0.31         2.28          2.34
With 4 threads     0.09         2.21          2.35