Introduction to Programming with MPI
------------------------------------

Practical Exercise 10 (Problem Decomposition)
---------------------------------------------

For general instructions, see the introduction to the collective
practicals.


These exercises are at about the same level of difficulty as converting
some actual scientific codes to using MPI.  You may prefer to leave
these until after the next lecture, as the examples for that are much
simpler, and one of them uses a technique which may help with the first
example here.

The first is a realistic example of how to distribute data in an
embarrassingly parallel problem, and is easier than it looks.

The second is a realistic example of how to distribute data for things
like finite element analysis and PDEs, and most of the coding and tuning
techniques are similar, though of course the actual calculation is
trivial.  It is a lot harder than it looks, and the difficulty is not in
using the MPI facilities; handling data boundaries correctly and
efficiently is probably the hardest part of distributed memory
parallelism.  Do not underestimate the difficulty of the second
practical, as it is very much harder than all of the others.

For now, regard the root process as process zero.


Question 1
----------

1.1 In the first lecture, you saw the results of a program that
calculated the area of a Mandelbrot set.  That program is in
"Programs/mandelbrot.f90", and C and C++ versions are in
"Programs/mandelbrot.c" and "Programs/mandelbrot.cpp".

Change that program to divide the problem up into small squares of NxN
points.  You will need to read in an extra parameter, indicating the
number of points on the side of a square (i.e. N).  Use the root process
for all I/O.

The root process should distribute one square to each process other
than itself, and each other process should return the area of the
Mandelbrot set within that square.  As each process returns, the root
process should give it another square to do, and there should be a
special 'square' that indicates there is no more work to do.  Use
blocking transfers and MPI_Probe (in the root process only), and do no
calculation on the root process.

Experiment with the size of the squares to see how the performance
varies.  On one system, with 7 CPUs, using a N value of about 50 was
optimal, and gave a speedup of 5.4 times.  There are several ways to
improve that to approaching 7, but most of them are outside the scope
of this course.


1.2 Now change to using non-blocking transfers, MPI_Waitany and
MPI_Testany instead of blocking ones, MPI_Probe and MPI_Iprobe.  The
differences should be trivial, and the performance almost identical.
In practice, you can use whichever technique you prefer.


CHALLENGE: people who want a challenge should modify the program to make
the root process do calculation when there is nothing else to do.  Doing
that is easy; doing it reliably and making it worthwhile is not.  One
key to effectiveness is for the root process to use smaller units of
work than the other processes.  There are no worked answers to this
suggestion.


Question 2
----------

2.1 The mathematician J.H Conway invented the "Game of Life", which is
played on an infinite two-dimensional grid of cells, with an arbitrary
starting point.  The neighbours of a cell are the 8 surrounding ones.
Consider the following figure:

                    A | B | C | D | E
                    -----------------
                    F | G | H | I | J
                    -----------------
                    K | L | M | N | O
                    -----------------
                    P | Q | R | S | T
                    -----------------
                    U | V | W | X | Y

The neighbours of M are G, H, I, L, N, Q, R and S, and the neighbours of S
are M, N, O, R, T, W, X and Y.

Each cell is alive or dead, and the game proceeds in a series of steps.
At each time-step, simultaneously and in parallel, all of the cells
transform according to the following rules:

    Any live cell with 2 or 3 live neighours remains live.

    Any live cell with 0, 1, 4 or more live neighbours dies.

    Any dead cell with exactly 3 live neighbours becomes live.

    Any dead cell with any other number of live neighbours remains dead.

This is commonly approximated by a bounded NxN grid with all cells
outside the grid bring regarded as permanently dead.  This question
assumes you will do that, and that you will not worry too much about
efficiency - e.g. you should probably use a single small integer for a
cell, and not a bit-map.

Write a program that implements "Life", using block partitioning on a
NxN grid, where you may assume that the number of processes is M*M
(i.e. M is 2 if you have 4 processes) and N is a multiple of M.  This
simplifies the coding considerably, but you should check all of your
assumptions.  It should read a file that contains the file name and the
number of steps, each on a single line.  It should print out the initial
and final grid.  Use the root process for all I/O.

The file should contain the grid size N, on one line, and N lines of
dots (') and asterisks (*) giving the initial grid; the dots are dead
cells and the asterisks are live ones.

You should use MPI_Alltoallv for handling the boundaries between
processes, remembering that counts of zero are allowed.

Read the file "Programs/random.life", which is for an 80x80 grid, run
it for 100 steps, and check that you get the same output as in the
answers.  There are also beacon.life, glider.life, pulsar.life and
gosper.life in the same directory; they have increasing but smaller
sizes that are multiples of 8, suitable for 1, 4, 16 or 64 processes.
bigrandom.life.gz is for a 256x256 grid, and is not suitable for
printing.  There are serial codes in that directory in life.f90, life.c
and life.cpp, but they will not help a lot as starting points.


2.2 Change the program to use non-blocking point-to-point transfers to
transfer the data.  Note that doing this is straightforward, once you
have the MPI_Alltoallv design, because the analysis and most of the code
needed is exactly the same as for MPI_Alltoallv.


NOTE: with both methods, parameterising the width of the boundary to K
means that you need synchronise only every K steps, though slightly more
memory and calculation are needed.  The specimen answers do precisely
that.  On one system with 4 processes and bigrandom.life.gz, the best
boundary was about 3-4, but every application will be different.


CHALLENGES: people who want a challenge should consider either or both
of the following extensions.  There are no worked answers to these
suggestions.

The specimen answers are wasteful of space, and have two copies of the
whole data on the root process.  Modify the program to never have more
than 1/Nth of the data in any process, where N is the number of
processes.

The restrictions on the grid sizes and number of processes are not
necessary (e.g. rectangular grids can be handled, and any number of
processes, even primes).  Modify the program to handle such sizes
efficiently.