Introduction to Programming with OpenMP
---------------------------------------

Practical Exercise 2 (Basics and Simple SIMD)
---------------------------------------------

You should normally start with the following commands:

    export OMP_NUM_THREADS=4
    export OMP_SCHEDULE=static

GNU Fortran:

    gfortran -Wall -Wextra -std=f2003 -pedantic -O3 -fopenmp \
        -fcheck=all -ftrapv -ffpe-trap=invalid,zero,overflow \
	-o <program> <program>.f90 -llapack -lblas
    ./<program> Programs/matrices_f_10

NAG Fortran:

    nagfor -openmp -O3 -C=all -o <program> \
        <program>.f90 -llapack -lblas
    ./<program> Programs/matrices_f_10

Note that using '-fcheck=all' or '-C=all' does a lot of runtime
checking that is not available for C or C++, but slows down the code
considerably.  If you want to compare times, or the tests are taking too
long, you should omit those, and possibly '-ftrapv' as well.
  
GNU C:

    gcc -Wall -Wextra -ansi -pedantic -O3 -fopenmp -ftrapv \
        -o <program> <program>.c -llapack -lblas
    ./<program> Programs/matrices_c_10

GNU C++:

    g++ -Wall -Wextra -ansi -pedantic -O3 -fopenmp -ftrapv \
        -o <program> <program>.cpp -llapack -lblas
    ./<program> Programs/matrices_c_10

WARNING: remember that the libraries may need to go before the program
on some systems.  See under gfortran about '-ftrapv' and timing.

When you have got the changes working, then increase the size of the
matrices to see how your tuning is going, by increasing the 10 at the
end of the argument to 1000.  You can also use 100 as an intermediate
step, if it helps.

If you have more cores, you can run with more than 4 threads, and this
will help to expose any race conditions you have coded (i.e. increase
the 4 in the setting of OMP_NUM_THREADS) and make the timing effects
clearer; all of the examples should work with any number.

Regrettably, this will not show you parallelisation on the PWF/MCS/DS,
as they are all low core-count systems.


Question 1
----------

In doing this question, declare every variable and array that is used in
any of the loops inside the DO/for directive blocks as either shared or
private.  Declare all loop indices and temporary variables as private,
and the main arrays as shared.  More details are given on this in the
next lecture.


1.1 Starting with Programs/Multiply.f90 or Programs/Multiply.c, add
directives to parallelise the outer loop in procedure Multiply, using a
combined parallel and DO/for directive.  Make sure that you declare ALL
variables used as shared or private, even though the default ones do
what you want.


1.2 Change the program to use the transpose of the first matrix.  In
Fortran, that needs only the addition of a TRANSPOSE call in the MATMUL,
the first argument of DGEMM changing to 't' and the sections reversing
in the DOT_PRODUCT call.  In C, it needs only the index expressions
reversing in the innermost loop and the first argument of DGEMM changing
to 't'.

Notice that the output shows different values, and the coded multiply
now runs a lot faster.  That is because it is more cache-friendly.


1.3 Change the program to split the directives into separate parallel
and DO/for ones.  Put the former around the call to Multiply and leave
the latter where it is.  Remember to declare all variables as shared or
private, even if the default is what you want, except that you will have
to leave the shared declarations off the DO/for directive (see the next
lecture for why).

You should notice no difference in the values or the times.


1.4 This example is to show some of the effects of including serial code
in parallel blocks, and why it is easier to use the combined directives.

Change the program to move the parallel directive to outside the while
timing code (i.e. from just before the first call to Times to just after
the last call to Check).  Don't both with declaring anything as shared,
as that is the default here.  Doing that may well go horribly
wrong, and is always incorrect.

Note that, even when this example appears to work, an excessive number
of lines of output appear.  It is equally likely to crash in an
unpredictable way.