Requests for technical support from the VASP group should be posted in the VASP-forum.

Difference between revisions of "Hybrid MPI/OpenMP parallelization"

From Vaspwiki
Jump to navigationJump to search
m (Vaspmaster moved page Hybrid openMPI/openMP parallelization to Hybrid MPI/OpenMP parallelization without leaving a redirect: Does not concern "OpenMPI" only)
Line 1: Line 1:
== Why hybrid parallelization ==  
+
== Compilation ==
  
Until now VASP performs all its parallel tasks with Message Parsing Interface (MPI) routines. This originates from the time where each CPU had only one single core, and all compute nodes (with one CPU) where interconnected by a local network. When starting a job in parallel on e.g. 32 cores, 32 VASP processes are created on 32 machines. Each process has to store certain amount of data, identical on all nodes, to be able to do his part of the calculation.  
+
To compile VASP with OpenMP support, add the following to the list of [[Installing_VASP.6.X.X#Precompiler_variables|precompiler flags]] in your <code>makefile.include</code> file:
In contrast today we have at least 4 cores on modern CPUs, meaning 32 VASP processes will be started on just 8 CPUs. Each of the processes has the same amount of identical data stored in the memory, 4 times on one physical machine. Furthermore the communication between the 4 cores on one CPU is still done with openMPI, and the communication to the other 12 cores on the other physical machines is done through only one infiniband interface.
 
  
The idea of a hybrid parallelization resolves in principle all these drawbacks. Coming back to the example of a 32 core job on eight 4 core CPUs, the idea is to distribute the calculation with openMPI on the different physical machines, so we only start 8 VASP openMPI processes, and parallelize on the physical CPU with openMP where the memory of the VASP openMPI process is shared among the created 4 openMP threads.
+
CPP_OPTIONS += -D_OPENMP
  
So the communication is reduced from 32 nodes to just 8, and no duplicated memory is present on the different CPUs.
+
In addition you will have to add some compiler specific options to [[Installing_VASP.6.X.X#Compiler_variables|the command that invokes your Fortran compiler (and sometimes to the linker as well)]].
  
== Compilation ==
+
When using an Intel toolchain (ifort + Intel MPI), for instance:
 +
 
 +
FC = mpiifort -qopenmp
 +
 
 +
It is probably best to base your <code>makefile.include</code> file on one of the archetypical <tt>/arch/makefile.include.*_omp</tt> files that are provided with your VASP.6.X.X release:
 +
 
 +
* [[Makefile.include.linux_intel_omp | makefile.include.linux_intel_omp]]
 +
* [[Makefile.include.linux_gnu_omp | makefile.include.linux_gnu_omp]]
 +
* [[Makefile.include.linux_nv_omp | makefile.include.linux_nv_omp]]
 +
* [[Makefile.include.linux_pgi_omp | makefile.include.linux_pgi_omp]]
 +
 
 +
(for use with the Intel, GNU, NVIDIA HPC-SDK, and PGI Fortran compilers, respectively).
 +
 
 +
To adapt these to the particulars of your system (if necessary) please read the [[Installing_VASP.6.X.X|instructions on the installation of VASP.6.X.X]].
 +
 
 +
'''N.B.''': When compiling for Intel CPUs we strongly recommend to use an all Intel toolchain (Intel compiler + Intel MPI + MKL libraries) since this will yield the best performance by far (and especially so for the hybrid MPI/OpenMP version of VASP). The aforementioned compilers and libraries are freely available in the form of [https://software.intel.com/content/www/us/en/develop/tools/oneapi/all-toolkits.html the Intel oneAPI base+HPC toolkits].
 +
 
 +
== Running multiple OpenMP-threads per MPI-rank ==
 +
 
 +
Basically running VASP on ''n'' MPI-ranks with ''m'' OpenMP-threads per rank, should be as simple as:
 +
 
 +
export OMP_NUM_THREADS=<m> ; mpirun -np <n> <your-vasp-executable>
 +
 
 +
Often, however, it is not that simple.
 +
In practice one has to make sure the MPI-ranks and OpenMP-threads they spawn are placed optimally onto the cores of the node(s), in order to get good performance.
 +
 
 +
As an example (a typical Intel Xeon-like architecture): Let us assume we plan to run on 2 nodes, each with 16 physical cores. These 16 cores per node are further divided into two ''packages'' (aka ''sockets'') of 8 cores each. The cores on a package share a block of memory and in addition they may access the memory associated with the other package on their node via a so-called ''crossbar-switch''. The latter, however, comes at a (slight) performance penalty.
 +
 
 +
In the aforementioned situation, a sensible placement of MPI-ranks and OpenMP-threads would for instance be the following: place 2 MPI-ranks on each package (''i.e.'', 8 MPI-ranks in total) and have each MPI-rank spawn 4 OpenMP-threads on the same package. These OpenMP-threads will all have fast access to the memory associated with their package, and will not have to access memory through the crossbar-switch.
 +
 
 +
Unfortunately, the way to achieve this depends on the flavour of MPI one uses:
 +
 
 +
=== Using OpenMPI ===
 +
 
 +
export OMP_NUM_THREADS=4
 +
mpirun -np 8 --map-by socket:PE=4 --bind-to core <your-vasp-executable>
  
=== Intel MKL openMP support ===
+
The above will assure that the additional threads each MPI-rank spawns reside on the same package/socket, and to direct OpenMPI to bind the threads to specific cores is crucial for performance.
  
In the Intel MKL numerical library openMP support is natively integrated. Even if we don't add openMP support explicitly to VASP we can activate the openMP support in the Intel MKL by setting the <code>OMP_NUM_THREADS</code> tag, for further information see [[Hybrid openMPI/openMP parallelization#Execution|Execution]]. This means that all calls to LAPACK, BLAS or FFT routines are parallelized using openMP. The performance of this hybrid parallelization of VASP is in general not as fast as the plain openMPI version.
+
When your CPU supports ''hyperthreading'' (and if this is enabled in the BIOS) there are more ''logical'' cores than ''physical''cores (typically a factor 2). In this case one should make sure the threads are placed on consecutive physical cores instead of consecutive logical cores. When using Intel's OpenMP runtime library (<tt>libiomp5.so</tt>) this can be achieved by:
  
=== Explicit openMP compilation ===
+
export KMP_AFFINITY=verbose,granularity=fine,compact,1,0
  
The hybrid parallelization is still under heavy development. Therefore only some of the most time consuming routines are parallelized with openMP.  
+
([https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html a detailed explanation of the <tt>KMP_AFFINITY</tt> environment variable and additional examples.]).
  
To enable support for openMP statements in the source files of VASP add the following flag to the compiler command in the makefile.
+
'''N.B.''': As far as we aware, this level of fine-grained control over the thread placement is only available from Intel's OpenMP runtime. In light of this fact, and since ''oversubscribing'', i.e., starting more OpenMP-threads than there are physical cores on a node seldomly brings a (large) performance gain, we recommend to disable hyperthreading all together.
  
FC=mpif90 -openmp
+
The <code>verbose</code> in the above is optional but it is a good idea to use it at least once to check whether the resultant thread placements is as intended.
  
and recompile all source files using
+
In addition to taking care of thread placement, it is often necessary to increase the size of the private stack of the OpenMP-threads (to 256 or even 512 Mbytes), since the default is in many cases too small for VASP to run, and will cause segfaults:
  
  make clean
+
  export OMP_STACKSIZE=512
make vasp
 
  
To get additional compiler output concerning openMP, one can increase the output level by setting
+
All of the above may be combined into a single command, as follows:
  
  FC=mpif90 -openmp -openmp-report2
+
  mpirun -np 8 --map-by socket:PE=4 --bind-to core \
 +
              -x OMP_NUM_THREADS=4 -x OMP_STACKSIZE=512m \
 +
              -x KMP_AFFINITY=verbose,granularity=fine,compact,1,0 \
 +
              <your-vasp-executable>
  
== Execution ==
+
=== Using Intel MPI ===
  
The execution statement depends heavily on your system!
+
In case one uses Intel MPI things are fortunately a bit less involved. Distributing 8 MPI-ranks over 2 nodes with 16 phyiscal cores each (2 sockets per node) allowing for 4 OpenMP-threads per MPI-rank is as simple as:
Our reference system consists of compute nodes with 4 cores per CPU.
 
The examples job script given here is for a job occuping a total of 64 cores, so 16 phyiscal nodes.
 
  
=== On our clusters ===
+
export OMP_NUM_THREADS=4
 +
export OMP_STACKSIZE=512m
 +
mpirun -np 8 -ppn 4 vasp
  
==== 1 openMPI process per node ====
 
<pre>
 
#$ -N test
 
#$ -q narwal.q
 
#$ -pe orte* 64
 
  
mpirun -bynode -np 8 -x OMP_NUM_THREADS=8 vasp
+
The <code>-ppn</code> option sets the number of MPI-ranks per node, four in this example. Intel MPI's default pinning will distribute the ranks evenly over the different sockets and will keep the OMP-threads within each rank/socket's domain.
</pre>
 
  
The MPI option <code>-bynode</code> ensures that the VASP processes are started in a round robin fashion, so each of the physical nodes gets 1 running VASP process. If we miss out this option, on each of the first 4 physical nodes 4 VASP processes would be started, leaving the remaining 12 nodes unoccupied.
+
As before, the environment variables may be passed as options to the <code>mpirun</code> command:
  
==== 1 openMPI processes per socket (workaround) ====
+
mpirun -np 8 -ppn 4 \
 +
        -genv OMP_NUM_THREADS=4 -genv OMP_STACKSIZE=512m \
 +
        <your-vasp-executable>
  
On server machines like <code>narwal</code> we have 2 CPU sockets per node. When using openMP on such machines with 8 threads, memory access has to be done over the slower crossbar switch. Therefore it is advisable to start one openMPI process per socket and use only 4 openMP threads.
+
== MPI versus MPI/OpenMP: the main difference ==
  
<pre>
+
As you may know, by default VASP distributes work and data over the MPI-ranks on a per-orbital basis (in a round-robin fashion): Bloch orbital 1 resides on rank 1, orbital 2 on rank 2. and so on.
#$ -N test
+
Concurrently, however, the work and data may be further distributed in the sense that not a single, but group of MPI-ranks, is responsible for the optimisation (and related FFTs) of a particular orbital.
#$ -q narwal.q
+
In the pure MPI version of VASP this is specified by means of the {{TAG|NCORE}} tag.
#$ -pe orte* 64
 
  
mpirun -bynode -cpus-per-proc 4 -np 16 -x OMP_NUM_THREADS=4 vasp
+
For instance, to distribute each individual Bloch orbital over 4 MPI-ranks one specifies:
  
</pre>
+
{{TAG|NCORE}} = 4
  
Although it exists a <code>-bysocket</code> flag, it was not possible to distribute the opeMPI processes correctly, therefore we did this workaround, where we assign to each openMPI process 4 cores via <code>-cpus-per-proc 4</code>.
+
The main difference between the pure MPI and the hybrid MPI/OpenMP version of VASP is that the latter will not distribute a single Bloch orbital over ''multiple MPI-ranks'' but will distribute the work on a single Bloch orbital over ''multiple OpenMP-threads''.
  
 +
As such one does not set {{TAG|NCORE}}=4 in the {{FILE|INCAR}} file but starts VASP with 4 OpenMP-threads/MPI-rank.
  
=== On the VSC2 ===
+
'''N.B.''': The hybrid MPI/OpenMP version of VASP will internally set {{TAG|NCORE}}=1 (regardless of what was specified in the {{FILE|INCAR}} file) when it detects it has been started on more than one OpenMP-thread.
  
On the VSC2 everything is different:
+
== Further reading ==
  
<pre>
+
* ''OpenMP in VASP: Threading and SIMD'', F. Wende, M. Marsman, J. Kim, F. Vasilev, Z. Zhao, and T. Steinke, [http://dx.doi.org/10.1002/qua.25851 Int. J. Quantum Chem. 2018;e25851]
export I_MPI_DAT_LIBRARY=/usr/lib64/libdat2.so.2                                                                                                   
 
export OMP_NUM_THREADS=4                                                                                                                           
 
export I_MPI_FABRICS=shm:dapl                                                                                                                       
 
export I_MPI_FALLBACK=0                                                                                                                             
 
export I_MPI_CPUINFO=proc                                                                                                                           
 
#export I_MPI_PIN_PROCESSOR_LIST=0-15                                                                                                               
 
export I_MPI_JOB_FAST_STARTUP=0
 
#export I_MPI_HYDRA_BRANCH_COUNT=130
 
  
#$ -v PATH
+
== Credits ==
#$ -v LD_LIBRARY_PATH
 
  
#$ -N vasp-mp
+
Many thanks to Jeongnim Kim and Fedor Vasilev at Intel, and Florian Wende and Thomas Steinke of the Zuse Institute Berlin (ZIB)!
#$ -pe mpich4 16
 
#$ -m be
 
  
cp  $TMPDIR/machines machines
+
== Related Tags and Sections ==
mpirun -machinefile $TMPDIR/machines -np 16 ~/vasp.5.1/vasp
 
</pre>
 
  
The important line is <code>#$ -pe mpich4 16</code>. Each physical node on the VSC2 has 16 cores and <code>mpich4</code> means that only 4 out of the 16 available cores per physical node are used. Therefore this job will need 64 cores all in all. In addition <code>mpich1</code>, <code>mpich2</code> and <code>mpich8</code> are also supported.
+
[[Installing_VASP.6.X.X|Installing VASP.6.X.X]],
 +
[[Makefile.include.linux_intel_omp | makefile.include.linux_intel_omp]],
 +
[[Makefile.include.linux_gnu_omp | makefile.include.linux_gnu_omp]],
 +
[[Makefile.include.linux_nv_omp | makefile.include.linux_nv_omp]],
 +
[[Makefile.include.linux_pgi_omp | makefile.include.linux_pgi_omp]]
  
 +
----
  
 
[[Category:VASP]][[Category:Installation]]
 
[[Category:VASP]][[Category:Installation]]

Revision as of 16:54, 11 February 2021

Compilation

To compile VASP with OpenMP support, add the following to the list of precompiler flags in your makefile.include file:

CPP_OPTIONS += -D_OPENMP

In addition you will have to add some compiler specific options to the command that invokes your Fortran compiler (and sometimes to the linker as well).

When using an Intel toolchain (ifort + Intel MPI), for instance:

FC = mpiifort -qopenmp

It is probably best to base your makefile.include file on one of the archetypical /arch/makefile.include.*_omp files that are provided with your VASP.6.X.X release:

(for use with the Intel, GNU, NVIDIA HPC-SDK, and PGI Fortran compilers, respectively).

To adapt these to the particulars of your system (if necessary) please read the instructions on the installation of VASP.6.X.X.

N.B.: When compiling for Intel CPUs we strongly recommend to use an all Intel toolchain (Intel compiler + Intel MPI + MKL libraries) since this will yield the best performance by far (and especially so for the hybrid MPI/OpenMP version of VASP). The aforementioned compilers and libraries are freely available in the form of the Intel oneAPI base+HPC toolkits.

Running multiple OpenMP-threads per MPI-rank

Basically running VASP on n MPI-ranks with m OpenMP-threads per rank, should be as simple as:

export OMP_NUM_THREADS=<m> ; mpirun -np <n> <your-vasp-executable>

Often, however, it is not that simple. In practice one has to make sure the MPI-ranks and OpenMP-threads they spawn are placed optimally onto the cores of the node(s), in order to get good performance.

As an example (a typical Intel Xeon-like architecture): Let us assume we plan to run on 2 nodes, each with 16 physical cores. These 16 cores per node are further divided into two packages (aka sockets) of 8 cores each. The cores on a package share a block of memory and in addition they may access the memory associated with the other package on their node via a so-called crossbar-switch. The latter, however, comes at a (slight) performance penalty.

In the aforementioned situation, a sensible placement of MPI-ranks and OpenMP-threads would for instance be the following: place 2 MPI-ranks on each package (i.e., 8 MPI-ranks in total) and have each MPI-rank spawn 4 OpenMP-threads on the same package. These OpenMP-threads will all have fast access to the memory associated with their package, and will not have to access memory through the crossbar-switch.

Unfortunately, the way to achieve this depends on the flavour of MPI one uses:

Using OpenMPI

export OMP_NUM_THREADS=4
mpirun -np 8 --map-by socket:PE=4 --bind-to core <your-vasp-executable>

The above will assure that the additional threads each MPI-rank spawns reside on the same package/socket, and to direct OpenMPI to bind the threads to specific cores is crucial for performance.

When your CPU supports hyperthreading (and if this is enabled in the BIOS) there are more logical cores than physicalcores (typically a factor 2). In this case one should make sure the threads are placed on consecutive physical cores instead of consecutive logical cores. When using Intel's OpenMP runtime library (libiomp5.so) this can be achieved by:

export KMP_AFFINITY=verbose,granularity=fine,compact,1,0

(a detailed explanation of the KMP_AFFINITY environment variable and additional examples.).

N.B.: As far as we aware, this level of fine-grained control over the thread placement is only available from Intel's OpenMP runtime. In light of this fact, and since oversubscribing, i.e., starting more OpenMP-threads than there are physical cores on a node seldomly brings a (large) performance gain, we recommend to disable hyperthreading all together.

The verbose in the above is optional but it is a good idea to use it at least once to check whether the resultant thread placements is as intended.

In addition to taking care of thread placement, it is often necessary to increase the size of the private stack of the OpenMP-threads (to 256 or even 512 Mbytes), since the default is in many cases too small for VASP to run, and will cause segfaults:

export OMP_STACKSIZE=512

All of the above may be combined into a single command, as follows:

mpirun -np 8 --map-by socket:PE=4 --bind-to core \
             -x OMP_NUM_THREADS=4 -x OMP_STACKSIZE=512m \
             -x KMP_AFFINITY=verbose,granularity=fine,compact,1,0 \
             <your-vasp-executable>

Using Intel MPI

In case one uses Intel MPI things are fortunately a bit less involved. Distributing 8 MPI-ranks over 2 nodes with 16 phyiscal cores each (2 sockets per node) allowing for 4 OpenMP-threads per MPI-rank is as simple as:

export OMP_NUM_THREADS=4
export OMP_STACKSIZE=512m
mpirun -np 8 -ppn 4 vasp


The -ppn option sets the number of MPI-ranks per node, four in this example. Intel MPI's default pinning will distribute the ranks evenly over the different sockets and will keep the OMP-threads within each rank/socket's domain.

As before, the environment variables may be passed as options to the mpirun command:

mpirun -np 8 -ppn 4 \
       -genv OMP_NUM_THREADS=4 -genv OMP_STACKSIZE=512m \
       <your-vasp-executable>

MPI versus MPI/OpenMP: the main difference

As you may know, by default VASP distributes work and data over the MPI-ranks on a per-orbital basis (in a round-robin fashion): Bloch orbital 1 resides on rank 1, orbital 2 on rank 2. and so on. Concurrently, however, the work and data may be further distributed in the sense that not a single, but group of MPI-ranks, is responsible for the optimisation (and related FFTs) of a particular orbital. In the pure MPI version of VASP this is specified by means of the NCORE tag.

For instance, to distribute each individual Bloch orbital over 4 MPI-ranks one specifies:

NCORE = 4

The main difference between the pure MPI and the hybrid MPI/OpenMP version of VASP is that the latter will not distribute a single Bloch orbital over multiple MPI-ranks but will distribute the work on a single Bloch orbital over multiple OpenMP-threads.

As such one does not set NCORE=4 in the INCAR file but starts VASP with 4 OpenMP-threads/MPI-rank.

N.B.: The hybrid MPI/OpenMP version of VASP will internally set NCORE=1 (regardless of what was specified in the INCAR file) when it detects it has been started on more than one OpenMP-thread.

Further reading

Credits

Many thanks to Jeongnim Kim and Fedor Vasilev at Intel, and Florian Wende and Thomas Steinke of the Zuse Institute Berlin (ZIB)!

Related Tags and Sections

Installing VASP.6.X.X, makefile.include.linux_intel_omp, makefile.include.linux_gnu_omp, makefile.include.linux_nv_omp, makefile.include.linux_pgi_omp