Laboratory for Scientific Computing, University of Cambridge

Running parallel jobs

Using MPI

For an introduction to MPI, see Nick Maclaren's MPI handouts.

The MPI implementation installed on the LSC network is OpenMPI. If you need a different implementation, please consult a sysadmin. In order to run an MPI program, you will typically do: mpirun -np 4 myMPICode

In order to check how many cores a particular machine has, check the system list. Some machines have hyper-threading switched on, which may help some performance-related issues, such as compilation or multi-tasking, but is very unlikely to help with floating-point intensive codes. The above list factors hyper-threading out. On machines with hyper-threading, top will claim 50% CPU usage when you are using all the available cores. This is normal. Any attempt to use more processes than the number of cores available will probably harm the performance of your code.

Debugging parallel jobs

There is a variety of ways to debug parallel codes. Probably the easiest is to use print statements throughout the code. If you find that merely putting print statements in the code causes the bugs to go away, then you probably have a race-condition or similar.

It is also possible to use a debugger such as gdb in parallel. If you really feel the need to go down this route, then try the script in /home/raid/pmb39/bin/mpi-screen. Run this using

	  ~pmb39/bin/mpi-screen -np 4 myProcess myArgs

This runs each process under gdb (with given program arguments), and allows you to switch between them using GNU screen. See the on-screen instructions and use the keys indicated to switch between processes. If you have any problems with this code, please let pmb39 know.

An alternative if you're working on a single machine is to use mpigdb, which will launch a separate xterm for each process. This is more robust than the script above, but requires X-forwarding.

Larger scale parallel jobs

The machines hex, tycho, hydra03 and hydra04 each have 16 cores, and 64GB of RAM each.

If you want to run MPI jobs across a pair of these machines, use mpirun -H hydra03,hydra04 -np 16 /path/to/my/code

You will probably need to give a complete path to your executable so that it can be found from both machines (remember that both /home and /data spaces are mounted across both systems). If you use the data-spaces, then you will need to use the path beginning /data rather than /local/data.

If you have problems with the MPI job hanging before even starting, try running something simple like ls. If this hangs, try the following:

	  which mpirun
	  ssh asteria export | grep PATH

If the path to the first mpirun is not visible when connecting via ssh, then you will need to ensure that your ~/.bashrc is set up correctly for a non-interactive shell. For example, you may need to comment out the line

	  [ -z "$PS1" ] && return

at the top of the file. If you are using a different shell, please check the manual to see what files are executed on login with a non-interactive shell.

Even larger scale?

If you find that you need to run even larger scale simulations, either because you have exceeded the memory available on a pair of hydras or because your simulations are taking too long, then please consult a sysadmin. One solution may be to use the Cambridge CSD3 facility.