Requests for technical support from the VASP group should be posted in the VASP-forum.

Hybrid openMPI/openMP parallelization

From Vaspwiki
Jump to navigationJump to search

Why hybrid parallelization

Until now VASP performs all its parallel tasks with Message Parsing Interface (MPI) routines. This originates from the time where each CPU had only one single core, and all compute nodes (with one CPU) where interconnected by a local network. When starting a job in parallel on e.g. 32 cores, 32 VASP processes are created on 32 machines. Each process has to store certain amount of data, identical on all nodes, to be able to do his part of the calculation. In contrast today we have at least 4 cores on modern CPUs, meaning 32 VASP processes will be started on just 8 CPUs. Each of the processes has the same amount of identical data stored in the memory, 4 times on one physical machine. Furthermore the communication between the 4 cores on one CPU is still done with openMPI, and the communication to the other 12 cores on the other physical machines is done through only one infiniband interface.

The idea of a hybrid parallelization resolves in principle all these drawbacks. Coming back to the example of a 32 core job on eight 4 core CPUs, the idea is to distribute the calculation with openMPI on the different physical machines, so we only start 8 VASP openMPI processes, and parallelize on the physical CPU with openMP where the memory of the VASP openMPI process is shared among the created 4 openMP threads.

So the communication is reduced from 32 nodes to just 8, and no duplicated memory is present on the different CPUs.

Compilation

Intel MKL openMP support

In the Intel MKL numerical library openMP support is natively integrated. Even if we don't add openMP support explicitly to VASP we can activate the openMP support in the Intel MKL by setting the OMP_NUM_THREADS tag, for further information see Execution. This means that all calls to LAPACK, BLAS or FFT routines are parallelized using openMP. The performance of this hybrid parallelization of VASP is in general not as fast as the plain openMPI version.

Explicit openMP compilation

The hybrid parallelization is still under heavy development. Therefore only some of the most time consuming routines are parallelized with openMP.

To enable support for openMP statements in the source files of VASP add the following flag to the compiler command in the makefile.

FC=mpif90 -openmp 

and recompile all source files using

make clean
make vasp

To get additional compiler output concerning openMP, one can increase the output level by setting

FC=mpif90 -openmp -openmp-report2

Execution

The execution statement depends heavily on your system! Our reference system consists of compute nodes with 4 cores per CPU. The examples job script given here is for a job occuping a total of 64 cores, so 16 phyiscal nodes.

On our clusters

1 openMPI process per node

 #$ -N test
 #$ -q narwal.q
 #$ -pe orte* 64

 mpirun -bynode -np 8 -x OMP_NUM_THREADS=8 vasp

The MPI option -bynode ensures that the VASP processes are started in a round robin fashion, so each of the physical nodes gets 1 running VASP process. If we miss out this option, on each of the first 4 physical nodes 4 VASP processes would be started, leaving the remaining 12 nodes unoccupied.

1 openMPI processes per socket (workaround)

On server machines like narwal we have 2 CPU sockets per node. When using openMP on such machines with 8 threads, memory access has to be done over the slower crossbar switch. Therefore it is advisable to start one openMPI process per socket and use only 4 openMP threads.

 #$ -N test
 #$ -q narwal.q
 #$ -pe orte* 64

 mpirun -bynode -cpus-per-proc 4 -np 16 -x OMP_NUM_THREADS=4 vasp

Although it exists a -bysocket flag, it was not possible to distribute the opeMPI processes correctly, therefore we did this workaround, where we assign to each openMPI process 4 cores via -cpus-per-proc 4.


On the VSC2

On the VSC2 everything is different:

export I_MPI_DAT_LIBRARY=/usr/lib64/libdat2.so.2                                                                                                     
export OMP_NUM_THREADS=4                                                                                                                            
export I_MPI_FABRICS=shm:dapl                                                                                                                        
export I_MPI_FALLBACK=0                                                                                                                              
export I_MPI_CPUINFO=proc                                                                                                                            
#export I_MPI_PIN_PROCESSOR_LIST=0-15                                                                                                                 
export I_MPI_JOB_FAST_STARTUP=0
#export I_MPI_HYDRA_BRANCH_COUNT=130

#$ -v PATH
#$ -v LD_LIBRARY_PATH

#$ -N vasp-mp
#$ -pe mpich4 16
#$ -m be

cp  $TMPDIR/machines machines
mpirun -machinefile $TMPDIR/machines -np 16 ~/vasp.5.1/vasp

The important line is #$ -pe mpich4 16. Each physical node on the VSC2 has 16 cores and mpich4 means that only 4 out of the 16 available cores per physical node are used. Therefore this job will need 64 cores all in all. In addition mpich1, mpich2 and mpich8 are also supported.