Jump to content

Requests for technical support from the VASP team should be posted in the VASP Forum.

NCORE: Difference between revisions

From VASP Wiki
No edit summary
No edit summary
 
(24 intermediate revisions by 6 users not shown)
Line 1: Line 1:
{{TAGDEF|NCORE|[integer]|1 }}
{{TAGDEF|NCORE|[integer]|1}}
 
Description: {{TAG|NCORE}} determines the number of compute cores that work on an individual orbital (available as of VASP.5.2.13).


Description: {{TAG|NCORE}} controls how many MPI ranks collaborate on a single band, parallelizing the [[Energy_cutoff_and_FFT_meshes#FFT_mesh|FFTs]] for that band.
----
----
VASP currently offers parallelisation and data distribution over bands and/or over plane wave coefficients, and as of VASP.5.3.2, parallelization over '''k'''-points (no data distribution, see {{TAG|KPAR}}).
{{available|5.2.13}}
To obtain high efficiency on massively parallel systems or modern multi-core machines, it is strongly recommended to use all at the same time. Most algorithms work with any data distribution (except for the single band conjugated gradient, which is considered to be obsolete).


{{TAG|NCORE}} determines how many cores share the work on an individual orbital. The current default is {{TAG|NCORE}}=1, meaning that one orbital is treated by one core. {{TAG|NPAR}} is then set to the total number of cores. If {{TAG|NCORE}} equals the total number of cores, {{TAG|NPAR}} is set to 1. This implies data distribution over plane wave coefficients only: all cores will work together on every individual band, i.e., the plane wave coefficients of each band are distributed over all cores. This is usually very slow and should be avoided.
VASP distributes the available MPI ranks into band groups that each work on one band, parallelizing the [[Energy_cutoff_and_FFT_meshes#FFT_mesh|FFTs]] for that band. For the common case that {{TAG|IMAGES|1}} and no other algorithm-dependent parallelization (e.g., {{TAG|NOMEGAPAR}}) is active:


{{TAG|NCORE}}=1 is the optimal setting for platforms with a small communication bandwidth and is a good choice for up to 8 cores, as well as for machines with a single core per node and a Gigabit network. However, this mode substantially increases the memory requirements, because the non-local projector functions must be stored entirely on each core. In addition, substantial all-to-all communications are required to orthogonalize the bands. On massively parallel systems and modern multi-core machines we strongly urge to set
:<math>\text{available ranks} = \frac{\text{total MPI ranks}}{\text{KPAR}}</math>


:<math>\textrm{NCORE}\approx\sqrt{\textrm{\#of}\; \textrm{cores}}</math>
{{TAG|NCORE}} sets the size of each band group. The number of bands treated in parallel is then:


or
{{TAG|NPAR}} = <math>\text{available ranks}</math> / {{TAG|NCORE}}


:<math>\textrm{NCORE}=\textrm{\#of}\;\textrm{cores}\;\textrm{per}\;\textrm{compute}\;\textrm{node}</math>
This makes {{TAG|NCORE}} and {{TAG|NPAR}} strict inverses of one another for a given number of available ranks. '''Do not set {{TAG|NCORE}} and {{TAG|NPAR}} at the same time.''' If both are present in the {{FILE|INCAR}}, {{TAG|NPAR}} takes precedence for legacy reasons. Nevertheless, we strongly encourage using {{TAG|NCORE}} instead of {{TAG|NPAR}} for all modern VASP versions.


== Common settings ==


In selected cases, we found that this improves the performance by a factor of up to four compared to the default, and it also significantly improves the stability of the code due to reduced memory requirements.
* {{TAG|NCORE}} = <math>\sim \sqrt{\text{available ranks}}</math>: each band group of NCORE ranks parallelizes its FFTs internally. This reduces both memory requirements (projector functions are shared within the group) and the cost of orthogonalization. '''This is the recommended regime for modern multi-core machines.'''


{{TAG|NCORE}} is available from VASP.5.2.13 on, and is more handy than the previous parameter {{TAG|NPAR}}.
* {{TAG|NCORE|1}} (default): each band is handled by a single rank. The maximum number of bands is treated in parallel, but the non-local projector functions must be stored in full on every rank, leading to high memory usage. In addition, orthogonalizing the bands requires heavy all-to-all communication between all ranks. This setting is appropriate for small unit cells or machines with limited communication bandwidth.
The user should either specify {{TAG|NCORE}} or {{TAG|NPAR}}, where {{TAG|NPAR}} takes a higher preference.
The relation between both parameters is


:<math>\textrm{NCORE}=\textrm{\#of}\; \textrm{cores}/\textrm{NPAR}</math>
* {{TAG|NCORE}} = <math>\text{available ranks}</math>: all ranks collaborate on a single band, distributing only the plane-wave coefficients. No band parallelization occurs. This is almost always very slow and should be avoided.
{{NB|warning|When running with [[Combining MPI and OpenMP|OpenMP threading]] or any of the [[GPU ports of VASP|GPU-offloaded code paths]], {{TAG|NCORE}} is automatically reset to {{TAG|NCORE|1}}, regardless of the value set in the {{FILE|INCAR}} file. FFT parallelization is then handled by OpenMP threads or the GPU. Use the number of OpenMP threads per rank for fine-grained control over FFT parallelization in those cases.}}


== Recommendations ==
{{NB|tip|Consult the [[optimizing the parallelization]] page for a step-by-step guide on how to optimize your parallelization. The examples below are only rough guidelines. For any non-trivial production run, perform a short benchmarking scan over a few values of {{TAG|NCORE}} before committing to one setting.}}


The optimum settings for {{TAG|NPAR}} and {{TAG|LPLANE}} depend strongly on the type of machine you are using.
General guidelines:
Some recommended setups:


*LINUX cluster linked by Infiniband, modern multicore machines:  
* '''Small systems or slow networks''' (up to ~8 cores, Gbit Ethernet interconnect): use {{TAG|NCORE|1}}. The limited number of ranks and slow interconnect make FFT parallelization within band groups inefficient.
:On a LINUX cluster with multicore machines linked by a fast network we recommend to set
<pre>
LPLANE = .TRUE.
NCORE  = number of cores per node (e.g. 4 or 8)
LSCALU = .FALSE.
NSIM  = 4
</pre>
:If very many nodes are used, it might be necessary to set LPLANE = .FALSE., but usually this offers very little advantage. For long (e.g. molecular dynamics runs), we recommend to optimize NPAR by trying short runs for different settings.


*LINUX cluster linked by 1 Gbit Ethernet, and LINUX clusters with single cores:
* '''Modern multi-core clusters''' (Infiniband or equivalent fast interconnect): set {{TAG|NCORE}} to a value between 2 and the number of cores per node. {{TAG|NCORE}} should be a factor of the number of cores per node to ensure that all intra-group FFT communication stays within a single node. As a rule of thumb, {{TAG|NCORE|4}} works well for ~100 atoms; {{TAG|NCORE|12}}–{{TAG|NCORE|16}} is often better for unit cells with more than 400 atoms.
:On a LINUX cluster linked by a relatively slow network, LPLANE must be set to .TRUE., and the NPAR flag should be equal to the number of cores:
<pre>
LPLANE = .TRUE.
NCORE = 1
LSCALU = .FALSE.
NSIM  = 4
</pre>
:Mind that you need at least a 100 Mbit full duplex network, with a fast switch offering at least 2 Gbit switch capacity to find usefull speedups. Multi-core machines should be always linked by an Infiniband, since Gbit is too slow for multi-core machines.


*Massively parallel machines (Cray, Blue Gene):
* '''NUMA-aware setting''': setting {{TAG|NCORE}} equal to the number of cores per [[Optimizing_the_parallelization#Understanding_the_hardware|NUMA domain]] is often a particularly good choice, because all intra-group FFT communication then stays within the fastest memory domain of the hardware. Use <code>lstopo</code> or <code>numactl --hardware</code> to determine the NUMA layout of your machine.
:On many massively parallel machines one is forced to use a huge number of cores. In this case load balancing problems and problems with the communication bandwidth are likely to be experienced. In addition the local memory is fairly small on some massively parallel machines; too small keep the real space projectors in the cache with any setting. Therefore, we recommend to set {{TAG|NPAR}} on these machines to &radic;''# of cores'' (explicit timing can be helpful to find the optimum value). The use of {{TAG|LPLANE}}=.TRUE. is only recommended if the number of nodes is significantly smaller than {{TAG|NGX}}, {{TAG|NGY}} and {{TAG|NGZ}}.


:In summary, the following setting is recommended
== Related tags and articles ==
<pre>
LPLANE = .FALSE.
NPAR  = sqrt(number of cores)
NSIM  = 1
</pre>
 
== Related Tags and Sections ==
{{TAG|NPAR}},
{{TAG|NPAR}},
{{TAG|KPAR}},
{{TAG|LPLANE}},
{{TAG|LPLANE}},
{{TAG|LSCALU}},
{{TAG|LSCALU}},
{{TAG|NSIM}},
{{TAG|NSIM}},
{{TAG|KPAR}}
{{TAG|LSCALAPACK}},
{{TAG|LSCAAWARE}},


----
[[GPU ports of VASP]],
[[The_VASP_Manual|Contents]]
[[Combining MPI and OpenMP]],
 
[[Optimizing the parallelization]],
[[Parallelization]],
[[Energy cutoff and FFT meshes]]
 
{{sc|NCORE|HowTo|Workflows that use this tag}}


[[Category:INCAR]][[Category:parallelization]]
[[Category:INCAR tag]][[Category:Performance]][[Category:Parallelization]]

Latest revision as of 09:37, 18 March 2026

NCORE = [integer]
Default: NCORE = 1 

Description: NCORE controls how many MPI ranks collaborate on a single band, parallelizing the FFTs for that band.


VASP distributes the available MPI ranks into band groups that each work on one band, parallelizing the FFTs for that band. For the common case that IMAGES = 1 and no other algorithm-dependent parallelization (e.g., NOMEGAPAR) is active:

[math]\displaystyle{ \text{available ranks} = \frac{\text{total MPI ranks}}{\text{KPAR}} }[/math]

NCORE sets the size of each band group. The number of bands treated in parallel is then:

NPAR = [math]\displaystyle{ \text{available ranks} }[/math] / NCORE

This makes NCORE and NPAR strict inverses of one another for a given number of available ranks. Do not set NCORE and NPAR at the same time. If both are present in the INCAR, NPAR takes precedence for legacy reasons. Nevertheless, we strongly encourage using NCORE instead of NPAR for all modern VASP versions.

Common settings

  • NCORE = [math]\displaystyle{ \sim \sqrt{\text{available ranks}} }[/math]: each band group of NCORE ranks parallelizes its FFTs internally. This reduces both memory requirements (projector functions are shared within the group) and the cost of orthogonalization. This is the recommended regime for modern multi-core machines.
  • NCORE = 1 (default): each band is handled by a single rank. The maximum number of bands is treated in parallel, but the non-local projector functions must be stored in full on every rank, leading to high memory usage. In addition, orthogonalizing the bands requires heavy all-to-all communication between all ranks. This setting is appropriate for small unit cells or machines with limited communication bandwidth.
  • NCORE = [math]\displaystyle{ \text{available ranks} }[/math]: all ranks collaborate on a single band, distributing only the plane-wave coefficients. No band parallelization occurs. This is almost always very slow and should be avoided.

Recommendations

General guidelines:

  • Small systems or slow networks (up to ~8 cores, Gbit Ethernet interconnect): use NCORE = 1. The limited number of ranks and slow interconnect make FFT parallelization within band groups inefficient.
  • Modern multi-core clusters (Infiniband or equivalent fast interconnect): set NCORE to a value between 2 and the number of cores per node. NCORE should be a factor of the number of cores per node to ensure that all intra-group FFT communication stays within a single node. As a rule of thumb, NCORE = 4 works well for ~100 atoms; NCORE = 12NCORE = 16 is often better for unit cells with more than 400 atoms.
  • NUMA-aware setting: setting NCORE equal to the number of cores per NUMA domain is often a particularly good choice, because all intra-group FFT communication then stays within the fastest memory domain of the hardware. Use lstopo or numactl --hardware to determine the NUMA layout of your machine.

Related tags and articles

NPAR, KPAR, LPLANE, LSCALU, NSIM, LSCALAPACK, LSCAAWARE,

GPU ports of VASP, Combining MPI and OpenMP,

Optimizing the parallelization, Parallelization, Energy cutoff and FFT meshes

Workflows that use this tag