Parallelization: Difference between revisions

From VASP Wiki
No edit summary
No edit summary
Line 1: Line 1:
For many complex problems, a single core is not enough to finish the calculation in a reasonable time.
For many complex problems, a single core is not enough to finish the calculation in a reasonable time.
VASP makes use of parallel machines splitting the calculation into many tasks, that communicate with each other using MPI.
VASP makes use of parallel machines splitting the calculation into many tasks, that communicate with each other using MPI.
By default, VASP distributes the number of bands ({{TAG|NBANDS}}) over the available cores.
By default, VASP distributes the number of bands ({{TAG|NBANDS}}) over the available MPI ranks.
But it is often beneficial to add parallelization of the FFTs ({{TAG|NCORE}}), parallelization over '''k''' points ({{TAG|KPAR}}), and parallelization over separate calculations ({{TAG|IMAGES}}).
But it is often beneficial to add parallelization of the FFTs ({{TAG|NCORE}}), parallelization over '''k''' points ({{TAG|KPAR}}), and parallelization over separate calculations ({{TAG|IMAGES}}).
All these tags default to 1 and divide the number of cores among the parallelization options.
All these tags default to 1 and divide the number of MPI ranks among the parallelization options.
There are also additional parallelization options for some algorithms in VASP.
There are also additional parallelization options for some algorithms in VASP.
::<math>
::<math>
\text{total cores} = \text{cores parallelizing bands} \times \text{NCORE} \times \text{KPAR} \times \text{IMAGES} \times \text{other algorithm-dependent tags}
\text{total ranks} = \text{ranks parallelizing bands} \times \text{NCORE} \times \text{KPAR} \times \text{IMAGES} \times \text{other algorithm-dependent tags}
</math>
</math>
In addition to the parallelization using MPI, VASP can make use of [[Hybrid_MPI/OpenMP_parallelization|OpenMP-threading]] and/or [[OpenACC_GPU_port_of_VASP|OpenACC (for the GPU-port)]].
In addition to the parallelization using MPI, VASP can make use of [[Hybrid_MPI/OpenMP_parallelization|OpenMP-threading]] and/or [[OpenACC_GPU_port_of_VASP|OpenACC (for the GPU-port)]].
Note that running on multiple OpenMP-threads and/or GPUs switches off the {{TAG|NCORE}} parallelization.
Note that running on multiple OpenMP threads and/or GPUs switches off the {{TAG|NCORE}} parallelization.


==Optimizing the parallelization==
==Optimizing the parallelization==
Line 22: Line 22:
Often, combining multiple parallelization options yields the fastest results because the parallel efficiency of each level drops near its limit.
Often, combining multiple parallelization options yields the fastest results because the parallel efficiency of each level drops near its limit.
For the default option (band parallelization), the limit is {{TAG|NBANDS}} divided by a small integer.
For the default option (band parallelization), the limit is {{TAG|NBANDS}} divided by a small integer.
Note that VASP will increase {{TAG|NBANDS}} to match the number of cores.
Note that VASP will increase {{TAG|NBANDS}} to match the number of ranks.
Choose {{TAG|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs.
Choose {{TAG|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs.
Recall that OpenMP and OpenACC enforce that {{TAG|NCORE}} is not set.
Recall that OpenMP and OpenACC enforce that {{TAG|NCORE}} is not set.
Line 33: Line 33:
==Caveat about the MPI setup==
==Caveat about the MPI setup==


The MPI setup determines the placement of the threads onto the nodes.
The MPI setup determines the placement of the ranks onto the nodes.
VASP assumes the threads first fill up a node before the next node is occupied.
VASP assumes the ranks first fill up a node before the next node is occupied.
As an example when running with 8 cores on two nodes, VASP expects thread 1–4 on node 1 and thread 5–8 on node 2.
As an example when running with 8 ranks on two nodes, VASP expects rank 1–4 on node 1 and rank 5–8 on node 2.
If the threads are placed differently, communication between the nodes occurs for every parallel FFT.
If the ranks are placed differently, communication between the nodes occurs for every parallel FFT.
Because FFTs are essential to VASP's speed this inhibits the performance of the calculation.
Because FFTs are essential to VASP's speed this inhibits the performance of the calculation.
A manifestation is an increase in computing time when the number of nodes is increased from 1 to 2.
A manifestation is an increase in computing time when the number of nodes is increased from 1 to 2.
Line 48: Line 48:


; {{TAG|KPAR}}: For Laplace transformed MP2 this tag [[LTMP2_-_Tutorial#Parallelization|has a different meaning]].
; {{TAG|KPAR}}: For Laplace transformed MP2 this tag [[LTMP2_-_Tutorial#Parallelization|has a different meaning]].
; {{TAG|NCORE_IN_IMAGE1}}: Defines how many cores work on the first image in the thermodynamic coupling constant integration ({{TAG|VCAIMAGES}}).
; {{TAG|NCORE_IN_IMAGE1}}: Defines how many ranks work on the first image in the thermodynamic coupling constant integration ({{TAG|VCAIMAGES}}).
; {{TAG|NOMEGAPAR}}: Parallelize over imaginary frequency points in GW and RPA calculations.
; {{TAG|NOMEGAPAR}}: Parallelize over imaginary frequency points in GW and RPA calculations.
; {{TAG|NTAUPAR}}: Parallelize over imaginary time points in GW and RPA calculations.
; {{TAG|NTAUPAR}}: Parallelize over imaginary time points in GW and RPA calculations.
Line 56: Line 56:
Both [[Hybrid_MPI/OpenMP_parallelization|OpenMP]] and [[OpenACC_GPU_port_of_VASP|OpenACC]] parallelize the FFTs and therefore disregard any conflicting specification of {{TAG|NCORE}}.
Both [[Hybrid_MPI/OpenMP_parallelization|OpenMP]] and [[OpenACC_GPU_port_of_VASP|OpenACC]] parallelize the FFTs and therefore disregard any conflicting specification of {{TAG|NCORE}}.
When combining these methods OpenACC takes precedence but any code not ported to OpenACC benefits from the additional OpenMP treads.
When combining these methods OpenACC takes precedence but any code not ported to OpenACC benefits from the additional OpenMP treads.
This approach is relevant because the recommended NVIDIA Collective Communications Library requires a single MPI thread per GPU.
This approach is relevant because the recommended NVIDIA Collective Communications Library requires a single MPI rank per GPU.

Revision as of 11:33, 8 April 2022

For many complex problems, a single core is not enough to finish the calculation in a reasonable time. VASP makes use of parallel machines splitting the calculation into many tasks, that communicate with each other using MPI. By default, VASP distributes the number of bands (NBANDS) over the available MPI ranks. But it is often beneficial to add parallelization of the FFTs (NCORE), parallelization over k points (KPAR), and parallelization over separate calculations (IMAGES). All these tags default to 1 and divide the number of MPI ranks among the parallelization options. There are also additional parallelization options for some algorithms in VASP.

In addition to the parallelization using MPI, VASP can make use of OpenMP-threading and/or OpenACC (for the GPU-port). Note that running on multiple OpenMP threads and/or GPUs switches off the NCORE parallelization.

Optimizing the parallelization

Tip: We offer only general advice here. The performance for specific systems may be significantly different. However, in many cases, one is interested in similar calculations. Then run a few of these cases varying the parallel setup and use the optimal choice of parameters for the rest.

When choosing the optimal performance try to get as close as possible to the actual system. This includes both the physical system (atoms, cell size, cutoff, ...) as well as the computational hardware (CPUs, interconnect, number of nodes, ...). If too many parameters are different, the parallel configuration may not be transferable to the production calculation. Nevertheless, a few steps of repetitive tasks give a good idea of an optimal choice for the full calculation. For example, running only a few electronic or ionic self-consistency steps instead of finishing the convergence.

Often, combining multiple parallelization options yields the fastest results because the parallel efficiency of each level drops near its limit. For the default option (band parallelization), the limit is NBANDS divided by a small integer. Note that VASP will increase NBANDS to match the number of ranks. Choose NCORE as a factor of the cores per node to avoid communicating between nodes for the FFTs. Recall that OpenMP and OpenACC enforce that NCORE is not set. The k-point parallelization is efficient but requires additional memory. Given sufficient memory, increase KPAR up to the number of irreducible k points. Keep in mind that KPAR should factorize the number of k points. Finally, IMAGES is required to split several VASP runs into separate calculations. The limit is dictated by the number of desired calculations.

Caveat about the MPI setup

The MPI setup determines the placement of the ranks onto the nodes. VASP assumes the ranks first fill up a node before the next node is occupied. As an example when running with 8 ranks on two nodes, VASP expects rank 1–4 on node 1 and rank 5–8 on node 2. If the ranks are placed differently, communication between the nodes occurs for every parallel FFT. Because FFTs are essential to VASP's speed this inhibits the performance of the calculation. A manifestation is an increase in computing time when the number of nodes is increased from 1 to 2. If NCORE is not used this issue is less severe but will still reduce the performance.

To address this issue, please check the setup of the MPI library and the submitted job script. It is usually possible to overwrite the placement by setting environment variables or command-line arguments. When in doubt, contact the HPC administration of your machine to investigate the behavior.

Additional parallelization options

KPAR
For Laplace transformed MP2 this tag has a different meaning.
NCORE_IN_IMAGE1
Defines how many ranks work on the first image in the thermodynamic coupling constant integration (VCAIMAGES).
NOMEGAPAR
Parallelize over imaginary frequency points in GW and RPA calculations.
NTAUPAR
Parallelize over imaginary time points in GW and RPA calculations.

OpenMP/OpenACC

Both OpenMP and OpenACC parallelize the FFTs and therefore disregard any conflicting specification of NCORE. When combining these methods OpenACC takes precedence but any code not ported to OpenACC benefits from the additional OpenMP treads. This approach is relevant because the recommended NVIDIA Collective Communications Library requires a single MPI rank per GPU.