Running parallel jobs
Using MPI
For an introduction to MPI, see Nick Maclaren's MPI handouts.
The MPI implementation installed on the LSC network is
OpenMPI. If you need a different implementation, please consult
a sysadmin. In order to run an MPI program, you will typically
do: mpirun -np 4 myMPICode
In order to check how many cores a particular machine has, check the system list. Some machines have hyper-threading switched on, which may help some performance-related issues, such as compilation or multi-tasking, but is very unlikely to help with floating-point intensive codes. The above list factors hyper-threading out. On machines with hyper-threading, top will claim 50% CPU usage when you are using all the available cores. This is normal. Any attempt to use more processes than the number of cores available will probably harm the performance of your code.
Debugging parallel jobs
There is a variety of ways to debug parallel codes. Probably the easiest is to use print statements throughout the code. If you find that merely putting print statements in the code causes the bugs to go away, then you probably have a race-condition or similar.
It is also possible to use a debugger such as gdb in parallel. If you really feel the need to go down this route, then try the script in /home/raid/pmb39/bin/mpi-screen. Run this using
~pmb39/bin/mpi-screen -np 4 myProcess myArgs
This runs each process under gdb (with given program arguments), and allows you to switch between them using GNU screen. See the on-screen instructions and use the keys indicated to switch between processes. If you have any problems with this code, please let pmb39 know.
An alternative if you're working on a single machine is to use mpigdb, which will launch a separate xterm for each process. This is more robust than the script above, but requires X-forwarding.
Larger scale parallel jobs
The machines hex
, tycho
,
hydra03
and hydra04
each have 16
cores, and 64GB of RAM each.
If you want to run MPI jobs across a pair of these machines, use
mpirun -H hydra03,hydra04 -np 16 /path/to/my/code
You will probably need to give a complete path to your
executable so that it can be found from both machines (remember
that both /home and /data spaces are mounted across both
systems). If you use the data-spaces, then you will need to use
the path beginning /data
rather than /local/data
.
If you have problems with the MPI job hanging before even starting, try running something
simple like ls
. If this hangs, try the following:
which mpirun ssh asteria export | grep PATH
If the path to the first mpirun
is not visible when connecting via ssh, then
you will need to ensure that your ~/.bashrc
is set up correctly for a
[ -z "$PS1" ] && return
at the top of the file. If you are using a different shell, please check the manual to see what files are executed on login with a non-interactive shell.
Even larger scale?
If you find that you need to run even larger scale simulations, either because you have exceeded the memory available on a pair of hydras or because your simulations are taking too long, then please consult a sysadmin. One solution may be to use the Cambridge CSD3 facility.