Introduction to Programming with MPI ------------------------------------ Practical Exercise 10 (Problem Decomposition) --------------------------------------------- For general instructions, see the introduction to the collective practicals. These exercises are at about the same level of difficulty as converting some actual scientific codes to using MPI. You may prefer to leave these until after the next lecture, as the examples for that are much simpler, and one of them uses a technique which may help with the first example here. The first is a realistic example of how to distribute data in an embarrassingly parallel problem, and is easier than it looks. The second is a realistic example of how to distribute data for things like finite element analysis and PDEs, and most of the coding and tuning techniques are similar, though of course the actual calculation is trivial. It is a lot harder than it looks, and the difficulty is not in using the MPI facilities; handling data boundaries correctly and efficiently is probably the hardest part of distributed memory parallelism. Do not underestimate the difficulty of the second practical, as it is very much harder than all of the others. For now, regard the root process as process zero. Question 1 ---------- 1.1 In the first lecture, you saw the results of a program that calculated the area of a Mandelbrot set. That program is in "Programs/mandelbrot.f90", and C and C++ versions are in "Programs/mandelbrot.c" and "Programs/mandelbrot.cpp". Change that program to divide the problem up into small squares of NxN points. You will need to read in an extra parameter, indicating the number of points on the side of a square (i.e. N). Use the root process for all I/O. The root process should distribute one square to each process other than itself, and each other process should return the area of the Mandelbrot set within that square. As each process returns, the root process should give it another square to do, and there should be a special 'square' that indicates there is no more work to do. Use blocking transfers and MPI_Probe (in the root process only), and do no calculation on the root process. Experiment with the size of the squares to see how the performance varies. On one system, with 7 CPUs, using a N value of about 50 was optimal, and gave a speedup of 5.4 times. There are several ways to improve that to approaching 7, but most of them are outside the scope of this course. 1.2 Now change to using non-blocking transfers, MPI_Waitany and MPI_Testany instead of blocking ones, MPI_Probe and MPI_Iprobe. The differences should be trivial, and the performance almost identical. In practice, you can use whichever technique you prefer. CHALLENGE: people who want a challenge should modify the program to make the root process do calculation when there is nothing else to do. Doing that is easy; doing it reliably and making it worthwhile is not. One key to effectiveness is for the root process to use smaller units of work than the other processes. There are no worked answers to this suggestion. Question 2 ---------- 2.1 The mathematician J.H Conway invented the "Game of Life", which is played on an infinite two-dimensional grid of cells, with an arbitrary starting point. The neighbours of a cell are the 8 surrounding ones. Consider the following figure: A | B | C | D | E ----------------- F | G | H | I | J ----------------- K | L | M | N | O ----------------- P | Q | R | S | T ----------------- U | V | W | X | Y The neighbours of M are G, H, I, L, N, Q, R and S, and the neighbours of S are M, N, O, R, T, W, X and Y. Each cell is alive or dead, and the game proceeds in a series of steps. At each time-step, simultaneously and in parallel, all of the cells transform according to the following rules: Any live cell with 2 or 3 live neighours remains live. Any live cell with 0, 1, 4 or more live neighbours dies. Any dead cell with exactly 3 live neighbours becomes live. Any dead cell with any other number of live neighbours remains dead. This is commonly approximated by a bounded NxN grid with all cells outside the grid bring regarded as permanently dead. This question assumes you will do that, and that you will not worry too much about efficiency - e.g. you should probably use a single small integer for a cell, and not a bit-map. Write a program that implements "Life", using block partitioning on a NxN grid, where you may assume that the number of processes is M*M (i.e. M is 2 if you have 4 processes) and N is a multiple of M. This simplifies the coding considerably, but you should check all of your assumptions. It should read a file that contains the file name and the number of steps, each on a single line. It should print out the initial and final grid. Use the root process for all I/O. The file should contain the grid size N, on one line, and N lines of dots (') and asterisks (*) giving the initial grid; the dots are dead cells and the asterisks are live ones. You should use MPI_Alltoallv for handling the boundaries between processes, remembering that counts of zero are allowed. Read the file "Programs/random.life", which is for an 80x80 grid, run it for 100 steps, and check that you get the same output as in the answers. There are also beacon.life, glider.life, pulsar.life and gosper.life in the same directory; they have increasing but smaller sizes that are multiples of 8, suitable for 1, 4, 16 or 64 processes. bigrandom.life.gz is for a 256x256 grid, and is not suitable for printing. There are serial codes in that directory in life.f90, life.c and life.cpp, but they will not help a lot as starting points. 2.2 Change the program to use non-blocking point-to-point transfers to transfer the data. Note that doing this is straightforward, once you have the MPI_Alltoallv design, because the analysis and most of the code needed is exactly the same as for MPI_Alltoallv. NOTE: with both methods, parameterising the width of the boundary to K means that you need synchronise only every K steps, though slightly more memory and calculation are needed. The specimen answers do precisely that. On one system with 4 processes and bigrandom.life.gz, the best boundary was about 3-4, but every application will be different. CHALLENGES: people who want a challenge should consider either or both of the following extensions. There are no worked answers to these suggestions. The specimen answers are wasteful of space, and have two copies of the whole data on the root process. Modify the program to never have more than 1/Nth of the data in any process, where N is the number of processes. The restrictions on the grid sizes and number of processes are not necessary (e.g. rectangular grids can be handled, and any number of processes, even primes). Modify the program to handle such sizes efficiently.