Hydra Compute Nodes
Multi-socket compute nodes
Some machines (e.g. cerberus1) within CSC have multiple physical processors; not just multi-core, but two physically separate processors. The motherboards of these machines then have separate sets of memory for each of the processors. However, if you log into one of these machines, Linux does not obviously show this detail, and each machine appears to be a homogeneous set of processors with a single block of memory.
All of these have a setup that associates a particular set of memory with each processor, being the memory physically closest to each processor. Therefore, the bandwidth between memory and processor depends on which region of memory contains the data being accessed by each processor. This NUMA setup ensures that the memory allocated by a particular processor should be in the region closest to that processor.
In order to test the memory bandwidth, you can use the Streams benchmark. You can modify the behaviour of this using the taskset and numactl commands, which force the process to run on particular cores and using memory from a particular socket. For example:
numactl --preferred=0 taskset -c 0-3 ./stream
uses cores 0-3 and memory node 0. Using --preferred=1
instead would give a lower bandwidth
as the memory allocated would be on the opposite node to the processor.
cerberus1
The cerberusN machines each have 2 16-core Intel(R) Xeon(R) Silver 4314 CPUs @ 2.40GHz. These are hyper-threaded, and so the machiens appear to have 64 cores overall, but only have 32 useful cores for floating-point use. They have 64GB RAM, so 2GB per core. For cerberus1 the bandwidth results are as follows:
Processor ID(s) | Memory socket(s) | Bandwidth |
---|---|---|
0 | 0 | 16.7GB/s |
0 | 1 | 12.3GB/s |
1 | 0 | 12.3GB/s |
1 | 1 | 16.7GB/s |
0,2 | 0 | 23.9GB/s |
0,2,4,6 | 0 | 27.5GB/s |
0,2,4,6,8,10,12,14 | 0 | 29.8GB/s |
0,1 | 0,1 | 31.4GB/s |
0-3 | 0,1 | 41.1GB/s |
0-7 | 0,1 | 34.8GB/s |
0-15 | 0,1 | 37.8GB/s |
0-31 | 0,1 | 65GB/s |
We see that even-numbered processors are closest to memory node 0, and odd-numbered ones to memory node 1. The bandwidth from two cores (23.9GB) is only 1.5 times that for a single core (16.7GB), and the bandwidth for 4 cores (27.5GB) is not much better.
Bandwidth to two cores from separate processors (e.g. 0,1) is double that for a single core (31.4GB/s) as we would expect. However, the maximum possible bandwidth from all cores is only double this (65GB/s).
We conclude that there is a bottleneck somewhere.
Guidelines
In general, Linux will distribute the processes between sockets, i.e.
two processes will probably end up on different sockets for an MPI/OpenMP run.
If you add the --bind-to-core
option to mpirun
(for OpenMPI) then this will prevent
processes moving between cores. You may wish to combine this with taskset
to force particular sockets/cores.
For OpenMP, you may wish to use the environment variables OMP_PLACES and OMP_PROC_BIND. See GCC OpenMP documentation for more details.
So, most of this page is of limited interest and/or use, since Linux will do the Right Thing (TM) in most cases, although it may explain why you do not magically get the performance improvements you expect.