Requests for technical support from the VASP group should be posted in the VASP-forum.

Hybrid MPI/OpenMP parallelization

From Vaspwiki
Jump to navigationJump to search

Compilation

To compile VASP with OpenMP support, add the following to the list of precompiler flags in your makefile.include file:

CPP_OPTIONS += -D_OPENMP

In addition you will have to add some compiler specific options to the command that invokes your Fortran compiler (and sometimes to the linker as well).

When using an Intel toolchain (ifort + Intel MPI), for instance:

FC = mpiifort -qopenmp

It is probably best to base your makefile.include file on one of the archetypical /arch/makefile.include.*_omp files that are provided with your VASP.6.X.X release:

(for use with the Intel, GNU, NVIDIA HPC-SDK, and PGI Fortran compilers, respectively).

To adapt these to the particulars of your system (if necessary) please read the instructions on the installation of VASP.6.X.X.

N.B.: When compiling for Intel CPUs we strongly recommend to use an all Intel toolchain (Intel compiler + Intel MPI + MKL libraries) since this will yield the best performance by far (and especially so for the hybrid MPI/OpenMP version of VASP). The aforementioned compilers and libraries are freely available in the form of the Intel oneAPI base+HPC toolkits.

An interesting case for the use of OpenMP support in VASP is to use it in addition to OpenACC in the OpenACC GPU-port of VASP.

Running multiple OpenMP-threads per MPI-rank

Basically running VASP on n MPI-ranks with m OpenMP-threads per rank, should be as simple as:

export OMP_NUM_THREADS=<m> ; mpirun -np <n> <your-vasp-executable>

Often, however, it is not that simple. In practice one has to make sure the MPI-ranks and OpenMP-threads they spawn are placed optimally onto the cores of the node(s), in order to get good performance.

As an example (a typical Intel Xeon-like architecture): Let us assume we plan to run on 2 nodes, each with 16 physical cores. These 16 cores per node are further divided into two packages (aka sockets) of 8 cores each. The cores on a package share a block of memory and in addition they may access the memory associated with the other package on their node via a so-called crossbar-switch. The latter, however, comes at a (slight) performance penalty.

In the aforementioned situation, a sensible placement of MPI-ranks and OpenMP-threads would for instance be the following: place 2 MPI-ranks on each package (i.e., 8 MPI-ranks in total) and have each MPI-rank spawn 4 OpenMP-threads on the same package. These OpenMP-threads will all have fast access to the memory associated with their package, and will not have to access memory through the crossbar-switch.

Unfortunately, the way to achieve this depends on the flavour of MPI one uses:

Using OpenMPI

export OMP_NUM_THREADS=4
mpirun -np 8 --map-by socket:PE=4 --bind-to core <your-vasp-executable>

The above will assure that the additional threads each MPI-rank spawns reside on the same package/socket, and to direct OpenMPI to bind the threads to specific cores is crucial for performance.

When your CPU supports hyperthreading (and if this is enabled in the BIOS) there are more logical cores than physicalcores (typically a factor 2). In this case one should make sure the threads are placed on consecutive physical cores instead of consecutive logical cores. When using Intel's OpenMP runtime library (libiomp5.so) this can be achieved by:

export KMP_AFFINITY=verbose,granularity=fine,compact,1,0

(a detailed explanation of the KMP_AFFINITY environment variable and additional examples.).

N.B.: As far as we aware, this level of fine-grained control over the thread placement is only available from Intel's OpenMP runtime. In light of this fact, and since oversubscribing, i.e., starting more OpenMP-threads than there are physical cores on a node seldomly brings a (large) performance gain, we recommend to disable hyperthreading all together.

The verbose in the above is optional but it is a good idea to use it at least once to check whether the resultant thread placements is as intended.

In addition to taking care of thread placement, it is often necessary to increase the size of the private stack of the OpenMP-threads (to 256 or even 512 Mbytes), since the default is in many cases too small for VASP to run, and will cause segfaults:

export OMP_STACKSIZE=512

All of the above may be combined into a single command, as follows:

mpirun -np 8 --map-by socket:PE=4 --bind-to core \
             -x OMP_NUM_THREADS=4 -x OMP_STACKSIZE=512m \
             -x KMP_AFFINITY=verbose,granularity=fine,compact,1,0 \
             <your-vasp-executable>

Using Intel MPI

In case one uses Intel MPI things are fortunately a bit less involved. Distributing 8 MPI-ranks over 2 nodes with 16 phyiscal cores each (2 sockets per node) allowing for 4 OpenMP-threads per MPI-rank is as simple as:

export OMP_NUM_THREADS=4
export OMP_STACKSIZE=512m
mpirun -np 8 -ppn 4 vasp


The -ppn option sets the number of MPI-ranks per node, four in this example. Intel MPI's default pinning will distribute the ranks evenly over the different sockets and will keep the OMP-threads within each rank/socket's domain.

As before, the environment variables may be passed as options to the mpirun command:

mpirun -np 8 -ppn 4 \
       -genv OMP_NUM_THREADS=4 -genv OMP_STACKSIZE=512m \
       <your-vasp-executable>

MPI versus MPI/OpenMP: the main difference

As you may know, by default VASP distributes work and data over the MPI-ranks on a per-orbital basis (in a round-robin fashion): Bloch orbital 1 resides on rank 1, orbital 2 on rank 2. and so on. Concurrently, however, the work and data may be further distributed in the sense that not a single, but group of MPI-ranks, is responsible for the optimisation (and related FFTs) of a particular orbital. In the pure MPI version of VASP this is specified by means of the NCORE tag.

For instance, to distribute each individual Bloch orbital over 4 MPI-ranks one specifies:

NCORE = 4

The main difference between the pure MPI and the hybrid MPI/OpenMP version of VASP is that the latter will not distribute a single Bloch orbital over multiple MPI-ranks but will distribute the work on a single Bloch orbital over multiple OpenMP-threads.

As such one does not set NCORE=4 in the INCAR file but starts VASP with 4 OpenMP-threads/MPI-rank.

N.B.: The hybrid MPI/OpenMP version of VASP will internally set NCORE=1 (regardless of what was specified in the INCAR file) when it detects it has been started on more than one OpenMP-thread.

Further reading

Credits

Many thanks to Jeongnim Kim and Fedor Vasilev at Intel, and Florian Wende and Thomas Steinke of the Zuse Institute Berlin (ZIB)!

Related Tags and Sections

Installing VASP.6.X.X, makefile.include.linux_intel_omp, makefile.include.linux_gnu_omp, makefile.include.linux_nv_omp, makefile.include.linux_pgi_omp