Laboratory for Scientific Computing, University of Cambridge

Hydra Compute Nodes

Hydra compute nodes

The machines hydra01, hydra02, hydra03, hydra04, tycho, and hex have more cores available than most other machines. Their properties are as follows:

hydra01/hydra02: Two Intel Xeon E5607 processors at 2.27GHz (8 cores each in total) - Dell 06FW8P Motherboard
hydra03/hydra04: Two Intel Xeon E5-2560 processors at 2.00GHz (16 cores each in total) - Intel Workstation Board W2600CR
tycho: Two Intel Xeon E5-2650 processors at 2.60GHz (16 cores in total) - Intel Server Board S2600IP
hex: Two Intel Xeon E5-2640 v3 processors at 2.60GHz (16 cores in total) - Winbond Electronics 1CM6952 motherboard

All of these have a setup that associates a particular set of memory with each processor, being the memory physically closest to each processor. Therefore, the bandwidth between memory and processor depends on which region of memory contains the data being accessed by each processor. The NUMA setup ensures that the memory allocated by a particular processor should be in the region closest to that processor.

In order to test the memory bandwidth, you can use the Streams benchmark. You can modify the behaviour of this using the taskset and numactl commands, which force the process to run on particular cores and using memory from a particular socket. For example:

	  numactl --preferred=0 taskset -c 0-3 ./stream

uses cores 0-3 and memory node 0. Using --preferred=1 instead would give a lower bandwidth as the memory allocated would be on the opposite node to the processor.

hydra01

For hydra01 (and hydra02), the results are as follows:

Processor ID(s)	Memory socket(s)	Bandwidth
0	0	7.61GB/s
0	1	5.66GB/s
4	0	5.65GB/s
4	1	7.66GB/s
0,1	0	10.3GB/s
0,1	1	7.42GB/s
0,4	0,1	14.7GB/s
0-3	0	10.6GB/s
0-3	1	7.7GB/s
0,2,4,6	0,1	17.9GB/s
0-7	0,1	17.8GB/s

From this we see that processors 0-3 are closest to memory node 0, and processors 4-7 are closest to memory node 1. Further, the bandwidth between sockets is lower than that between a socket and its own memory (as one would expect). However, when all processors are used, and associated with their own memory, the bandwidth is (more-or-less) the sum of that from individual sockets, so there is no conflict between the socket communications.

On the other hand, using two cores on a single processor maxes out the memory bandwidth for that processor, since increasing to four cores barely increases performance. So, if your program is memory-access-bound, it may not speed up appreciably going from 2 processes to 4 processes on a single socket.

hydra03

For hydra03 (and hydra04), the results are as follows:

Processor ID(s)	Memory socket(s)	Bandwidth
0	0	12.6GB/s
0	1	8.74GB/s
8	0	8.7GB/s
8	1	12.9GB/s
0,1	0	23.6GB/s
0,1	1	13.6GB/s
0,8	0,1	10.7GB/s
0-3	0	18.6GB/s
0-3	1	12.2B/s
0-7	0	24.6GB/s
0-7	1	13.0GB/s
0-3,8-11	0,1	30-32GB/s
0-15	0,1	30-36GB/s

From this we see that processors 0-7 are closest to memory node 0, and processors 8-15 are closest to memory node 1. Further, the bandwidth between sockets is lower than that between a socket and its own memory (as one would expect). There is some increase in bandwidth going from 2 cores to 4 cores on one socket, and more going from 4 cores to 8 cores, so this looks hopeful, although the bandwidth does not scale with number of cores, so again if your program is memory-access-bound, it will not show as much performance improvement as you might hope.

tycho

For tycho, the results are as follows:

Processor ID(s)	Memory socket(s)	Bandwidth
0	0	11.0GB/s
0	1	7.26GB/s
8	0	8.2GB/s
8	1	11.0GB/s
0,1	0	20.0GB/s
0,1	1	14.0GB/s
0,8	0,1	22.0GB/s
0-3	0	26.5GB/s
0-3	1	22.9GB/s
0,1,8,9	0,1	39GB/s
0-7	0	26.6GB/s
0-7	1	22.2GB/s
0-3,8-11	0,1	48-53GB/s
0-15	0,1	53GB/s

Thus, we see that processors 0-7 are closest to memory node 0, and processors 8-15 are closest to memory node 1. Further, the bandwidth between sockets is lower than that between a socket and its own memory (as one would expect). The bandwidth from two cores is double that on a single core, and the bandwidth on four cores of a single socket is slightly higher than that of two cores. Also, the bandwidth to four cores on the opposite memory node is only slightly lower than that on the same memory node. However, bandwidth to all 8 cores on a socket is only about the same as for 4 cores. The total bandwidth for all 16 cores is double that for 8 cores on each socket. So, if your code is memory-bound, you will probably get good speed-up from 1 to 2 and 2 to 4 cores, but very little beyond that, even assuming that you use taskset and numactl appropriately.

hex

For hex, the results are as follows:

Processor ID(s)	Memory socket(s)	Bandwidth
0	0	14.9GB/s
0	1	12.3GB/s
8	0	14.9GB/s
8	1	12.4GB/s
0,1	0	29.7GB/s
0,1	1	18.7GB/s
0,8	0,1	20.0GB/s
0-3	0	33.9GB/s
0-3	1	20.9GB/s
0,1,8,9	0,1	34.6GB/s
0-7	0	41.6GB/s
0-7	1	20.9GB/s
0-3,8-11	0,1	41.4GB/s
0-15	0,1	41.2GB/s

Thus, we see that processors 0-7 are closest to memory node 0, and processors 8-15 are closest to memory node 1. Further, the bandwidth between sockets is lower than that between a socket and its own memory (as one would expect). The bandwidth from two cores (29.7GB/s) is double that on a single core (14.9GB/s), and the bandwidth on four cores of a single socket (33.9GB/s) is slightly higher than that of two cores. However, the bandwidth to four cores on the opposite memory node (20.9GB/s) is substantially lower than that on the same memory node. Bandwidth to all 8 cores on a socket (41.6GB/s) is a bit higher than for 4 cores. The total bandwidth for all 16 cores is the same as that for 8 cores.

This suggests that each individual socket scales well for memory access, but that the connect between the sockets is relatively poor.

cerberus1

The cerberusN machines each have 2 16-core Intel(R) Xeon(R) Silver 4314 CPUs @ 2.40GHz. These are hyper-threaded, and so the machiens appear to have 64 cores overall, but only have 32 useful cores for floating-point use. They have 64GB RAM, so 2GB per core. For cerberus1 the bandwidth results are as follows:

Processor ID(s)	Memory socket(s)	Bandwidth
0	0	16.7GB/s
0	1	12.3GB/s
1	0	12.3GB/s
1	1	16.7GB/s
0,2	0	23.9GB/s
0,2,4,6	0	27.5GB/s
0,2,4,6,8,10,12,14	0	29.8GB/s
0,1	0,1	31.4GB/s
0-3	0,1	41.1GB/s
0-7	0,1	34.8GB/s
0-15	0,1	37.8GB/s
0-31	0,1	65GB/s

We see that even-numbered processors are closest to memory node 0, and odd-numbered ones to memory node 1. The bandwidth from two cores (23.9GB) is only 1.5 times that for a single core (16.7GB), and the bandwidth for 4 cores (27.5GB) is not much better.

Bandwidth to two cores from separate processors (e.g. 0,1) is double that for a single core (31.4GB/s) as we would expect. However, the maximum possible bandwidth from all cores is only double this (65GB/s).

We conclude that there is a bottleneck somewhere.

Guidelines

In general, Linux will distribute the processes between sockets, i.e. two processes will probably end up on different sockets for an MPI/OpenMP run. If you add the --bind-to-core option to mpirun (for OpenMPI) then this will prevent processes moving between cores. You may wish to combine this with taskset to force particular sockets/cores.

So, most of this page is of limited interest and/or use, since Linux will do the Right Thing (TM) in most cases, although it may explain why you do not magically get the performance improvements you expect.