OpenACC GPU port of VASP: Difference between revisions

From VASP Wiki
(25 intermediate revisions by one other user not shown)
Line 12: Line 12:
* To compile the OpenACC version of VASP you need either the [https://developer.nvidia.com/hpc-sdk NVIDIA HPC-SDK] or a recent version (>=19.10) of PGI's Compilers & Tools.
* To compile the OpenACC version of VASP you need either the [https://developer.nvidia.com/hpc-sdk NVIDIA HPC-SDK] or a recent version (>=19.10) of PGI's Compilers & Tools.
:In principle any compiler that supports at least OpenACC standard 2.6 should do the trick, but we have tried and tested the aforementioned ones.
:In principle any compiler that supports at least OpenACC standard 2.6 should do the trick, but we have tried and tested the aforementioned ones.
:'''N.B.: When you choose to use the NVIDIA HPC-SDK (which we recommend), then be sure to use version 20.9!'''
:We were informed that version 20.11 has performance issues with some of the programming constructs used by VASP, and in version 21.1 a bug was introduced. All of these issues should be solved with the release of the NVIDIA HPC-SDK 21.2.


<u>''Libraries''</u>
<u>''Libraries''</u>
Line 31: Line 34:


'''N.B.''': Running VASP on other NVIDIA GPUs (e.g. "gaming" hardware) is technically possible but not advisable: these GPUs are not well suited since they do not offer fast double precision floating point arithmetic (FP64) performance and in general have smaller memories without error correction code (ECC) capabilities.
'''N.B.''': Running VASP on other NVIDIA GPUs (e.g. "gaming" hardware) is technically possible but not advisable: these GPUs are not well suited since they do not offer fast double precision floating point arithmetic (FP64) performance and in general have smaller memories without error correction code (ECC) capabilities.
== Features and limitations ==


* Most features of VASP have been ported to GPU using OpenACC, with the notable exception of everything involving the RPA: GW and ACFDT. This is work in progress.
== Building ==


* The use of parallel FFTs of the wave functions ({{TAG|NCORE}}>1) should be avoided for performance reasons. Currently the OpenACC version will automatically switch to {{TAG|NCORE}}=1 even if otherwise specified in the {{FILE|INCAR}} file.
To build the OpenACC port of VASP it is probably best to base your <code>makefile.include</code> file on one of the archetypical templates.


* '''Due to the use of NCCL, the OpenACC version of VASP may only be executed using a single MPI-rank per available GPU:'''
When using the NVIDIA HPC-SDK:
: Using NCCL has large performance benefits in the majority of cases. However, we are aware of the fact that for calculations on small systems it would be useful to retain the ability of having multiple MPI-ranks share a GPU, and plan the make the use of NCCL optional to remove this limitation.
* [[Makefile.include.linux_nv_acc | makefile.include.linux_nv_acc]]
* [[Makefile.include_nv_acc+omp+mkl | makefile.include.linux_nv_acc+omp+mkl]] (with additional OpenMP support)


== Running the OpenACC version ==
or for PGI's Compilers & Tools:
* [[Makefile.include.linux_pgi_acc | makefile.include.linux_pgi_acc]]


* Use a single MPI-rank per GPU (currently the use of NCCL precludes the use of multiple ranks per GPU).
To adapt these to the particulars of your system (if necessary) please read the [[Installing_VASP.6.X.X|instructions on the installation of VASP.6.X.X]].


== Features and limitations ==


* Use OpenMP-threads in addition to MPI-ranks to leverage more of the available CPU power. The OpenACC version is currently limited to the use of 1 MPI-rank/GPU, which means that potentially quite a bit of CPU power remains unused. Since there are still parts of the code that run CPU-side it can be beneficial to allow for the use of multiple OpenMP-threads per MPI-rank:
* Most features of VASP have been ported to GPU using OpenACC, with the notable exception of everything involving the RPA: GW and ACFDT. This is work in progress.


** To build VASP with OpenACC ''and'' OpenMP support look at the following [[makefile.include_nv_acc+omp+mkl|makefile.include]] file. Note: here we use Intel's MKL library for CPU-sided FFTW, BLAS, LAPACK, and ScaLAPACK calls (recommended especially when compiling for Intel CPUs).
* The use of parallel FFTs of the wave functions ({{TAG|NCORE}}>1) should be avoided for performance reasons. Currently the OpenACC version will automatically switch to {{TAG|NCORE}}=1 even if otherwise specified in the {{FILE|INCAR}} file.
** Correct placement and pinning of MPI-ranks and OpenMP-threads onto the CPU cores can be a bit tricky, and depends on the particular flavour of MPI one uses.


* '''Due to the use of NCCL, the OpenACC version of VASP may only be executed using a single MPI-rank per available GPU:'''
: Using NCCL has large performance benefits in the majority of cases. However, we are aware of the fact that for calculations on small systems it would be useful to retain the ability of having multiple MPI-ranks share a GPU, and plan to make the use of NCCL optional to remove this limitation.


* To achieve the best performance it is important to chose {{TAG|KPAR}} and {{TAG|NSIM}} wisely. Unfortunately the ideal values will depend on the particulars of your system, both in the sense of workload as well as hardware, so you will have to experiment with different settings. However, ss a rule of thumb one can say:
== Running the OpenACC version ==


** Set {{TAG|KPAR}} to the number of GPUs (= MPI-ranks) you are going to use. This only makes sense, though, when the number of irreducible '''k'''-points in your calculation is more or less evenly dividable by {{TAG|KPAR}}, otherwise the distribution of the work over the MPI-ranks will be strongly imbalanced. This means your options in choosing this parameter are somewhat limited.  
<ol>
** {{TAG|NSIM}} determines the number of bands that are optimised simultaneously in many of the electronic solvers (e.g RMM-DIIS and blocked-Davidson). As a rule one should choose this parameter larger to get good performance on GPUs than one would for CPU-sided execution.
<li>
: '''N.B.''': For optimal CPU-sided execution of VASP one would normally experiment with different settings for {{TAG|NCORE}} as well. When running on GPUs anything different from {{TAG|NCORE}}=1 will adversely affect performance, and VASP will automatically switch to {{TAG|NCORE}}=1, even if otherwise specified in the {{FILE|INCAR}} file.
Use a single MPI-rank per GPU (currently the use of NCCL precludes the use of multiple ranks per GPU).
</li>
<li>
Use OpenMP-threads in addition to MPI-ranks to leverage more of the available CPU power. The OpenACC version is currently limited to the use of 1 MPI-rank/GPU, which means that potentially quite a bit of CPU power remains unused. Since there are still parts of the code that run CPU-side it can be beneficial to allow for the use of multiple OpenMP-threads per MPI-rank:
* To see how to build VASP with OpenACC- ''and'' OpenMP-support have a look at the [[makefile.include_nv_acc+omp+mkl]] file.
:'''N.B.''': here we link against Intel's MKL library for CPU-sided FFTW, BLAS, LAPACK, and scaLAPACK calls and the Intel OpenMP runtime library (<tt>libiomp5.so</tt>). This is strongly recommended when compiling for Intel CPUs, especially when using multiple threads. To ensure that MKL uses the Intel OpenMP runtime library you need to set an environment variable, either by:
:<pre>export MKL_THREADING_LAYER=INTEL</pre>
:or by adding:
:<pre>-x MKL_THREADING_LAYER=INTEL</pre>
:as an option to your <code>mpirun</code> command.
* Correct [[Hybrid_MPI/OpenMP_parallelization#Using_OpenMPI|placement and pinning of OpenMPI-ranks and OpenMP-threads onto the CPU cores]] can be a bit tricky, and depends on the particular flavour of MPI one uses.
</li>
<li>
To achieve the best performance it is important to chose {{TAG|KPAR}} and {{TAG|NSIM}} wisely. Unfortunately, the ideal values will depend on the particulars of your system, both in the sense of workload as well as hardware, so you will have to experiment with different settings. However, as a rule of thumb one can say:
* Set {{TAG|KPAR}} to the number of GPUs (= MPI-ranks) you are going to use. This only makes sense, though, when the number of irreducible '''k'''-points in your calculation is more or less evenly dividable by {{TAG|KPAR}}, otherwise the distribution of the work over the MPI-ranks will be strongly imbalanced. This means your options in choosing this parameter are somewhat limited.  
* {{TAG|NSIM}} determines the number of bands that are optimised simultaneously in many of the electronic solvers (e.g RMM-DIIS and blocked-Davidson). As a rule one should choose this parameter larger to get good performance on GPUs than one would for CPU-sided execution.
'''N.B.''': For optimal CPU-sided execution of VASP one would normally experiment with different settings for {{TAG|NCORE}} as well. When running on GPUs anything different from {{TAG|NCORE}}=1 will adversely affect performance, and VASP will automatically switch to {{TAG|NCORE}}=1, even if otherwise specified in the {{FILE|INCAR}} file.
</li>
</ol>


== Credits ==
== Credits ==
A special thanks goes out to: Stefan Maintz, Markus Wetzstein, Alexey Romanenko, and Andreas Hehn from NVIDIA for all their help porting VASP to GPU using OpenACC!


== Related Tags and Sections ==
== Related Tags and Sections ==


[[Installing_VASP.6.X.X|Installing VASP.6.X.X]]
[[Installing_VASP.6.X.X|Installing VASP.6.X.X]],
[[Makefile.include.linux_nv_acc | makefile.include.linux_nv_acc]],
[[Makefile.include_nv_acc+omp+mkl | makefile.include.linux_nv_acc+omp+mkl]],
[[Makefile.include.linux_pgi_acc | makefile.include.linux_pgi_acc]],
[[Hybrid_MPI/OpenMP_parallelization|Hybrid MPI/OpenMP parallelization]]


----
----

Revision as of 07:03, 12 February 2021

With VASP.6.2.0 we officially released the OpenACC GPU-port of VASP: Official in the sense that we now strongly recommend to use this OpenACC version to run VASP on GPU accelerated systems.

The previous CUDA-C GPU-port of VASP is considered to be deprecated and is no longer actively developed, maintained, or supported. In the near future, the CUDA-C GPU-port of VASP will be dropped completely.

Requirements

Software stack

Compiler

  • To compile the OpenACC version of VASP you need either the NVIDIA HPC-SDK or a recent version (>=19.10) of PGI's Compilers & Tools.
In principle any compiler that supports at least OpenACC standard 2.6 should do the trick, but we have tried and tested the aforementioned ones.
N.B.: When you choose to use the NVIDIA HPC-SDK (which we recommend), then be sure to use version 20.9!
We were informed that version 20.11 has performance issues with some of the programming constructs used by VASP, and in version 21.1 a bug was introduced. All of these issues should be solved with the release of the NVIDIA HPC-SDK 21.2.

Libraries

  • When compiling with PGI Compilers & Tools: the QD (software emulated quadruple precision arithmetic) and NCCL (>=2.7.8) libraries. (Conveniently, these libraries are part of the NVIDIA HPC-SDK.)
  • An installation of NVIDIA's CUDA Toolkit (>= 10.0): the necessary parts are already bundled into the NVIDIA HPC-SDK and PGI's Compilers & Tools, so there is no need to separately install the CUDA Toolkit if you use either of the latter compiler suites.
  • A CUDA-aware version of MPI: the OpenMPI installations that ship with the NVIDIA HPC-SDK and PGI's Compilers & Tools are CUDA-aware.

Drivers

  • You need a CUDA driver that supports at least CUDA-10.0 (see above).

Hardware

We have only tested the OpenACC GPU-port of VASP with the following NVIDIA GPUs:

  • NVIDIA datacenter GPUs: P100 (Pascal), V100 (Volta), and A100 (Ampere).
  • NVIDIA Quadro GPUs: GP100 (Pascal), and GV100 (Volta).

N.B.: Running VASP on other NVIDIA GPUs (e.g. "gaming" hardware) is technically possible but not advisable: these GPUs are not well suited since they do not offer fast double precision floating point arithmetic (FP64) performance and in general have smaller memories without error correction code (ECC) capabilities.

Building

To build the OpenACC port of VASP it is probably best to base your makefile.include file on one of the archetypical templates.

When using the NVIDIA HPC-SDK:

or for PGI's Compilers & Tools:

To adapt these to the particulars of your system (if necessary) please read the instructions on the installation of VASP.6.X.X.

Features and limitations

  • Most features of VASP have been ported to GPU using OpenACC, with the notable exception of everything involving the RPA: GW and ACFDT. This is work in progress.
  • The use of parallel FFTs of the wave functions (NCORE>1) should be avoided for performance reasons. Currently the OpenACC version will automatically switch to NCORE=1 even if otherwise specified in the INCAR file.
  • Due to the use of NCCL, the OpenACC version of VASP may only be executed using a single MPI-rank per available GPU:
Using NCCL has large performance benefits in the majority of cases. However, we are aware of the fact that for calculations on small systems it would be useful to retain the ability of having multiple MPI-ranks share a GPU, and plan to make the use of NCCL optional to remove this limitation.

Running the OpenACC version

  1. Use a single MPI-rank per GPU (currently the use of NCCL precludes the use of multiple ranks per GPU).
  2. Use OpenMP-threads in addition to MPI-ranks to leverage more of the available CPU power. The OpenACC version is currently limited to the use of 1 MPI-rank/GPU, which means that potentially quite a bit of CPU power remains unused. Since there are still parts of the code that run CPU-side it can be beneficial to allow for the use of multiple OpenMP-threads per MPI-rank:
    N.B.: here we link against Intel's MKL library for CPU-sided FFTW, BLAS, LAPACK, and scaLAPACK calls and the Intel OpenMP runtime library (libiomp5.so). This is strongly recommended when compiling for Intel CPUs, especially when using multiple threads. To ensure that MKL uses the Intel OpenMP runtime library you need to set an environment variable, either by:
    export MKL_THREADING_LAYER=INTEL
    or by adding:
    -x MKL_THREADING_LAYER=INTEL
    as an option to your mpirun command.
  3. To achieve the best performance it is important to chose KPAR and NSIM wisely. Unfortunately, the ideal values will depend on the particulars of your system, both in the sense of workload as well as hardware, so you will have to experiment with different settings. However, as a rule of thumb one can say:
    • Set KPAR to the number of GPUs (= MPI-ranks) you are going to use. This only makes sense, though, when the number of irreducible k-points in your calculation is more or less evenly dividable by KPAR, otherwise the distribution of the work over the MPI-ranks will be strongly imbalanced. This means your options in choosing this parameter are somewhat limited.
    • NSIM determines the number of bands that are optimised simultaneously in many of the electronic solvers (e.g RMM-DIIS and blocked-Davidson). As a rule one should choose this parameter larger to get good performance on GPUs than one would for CPU-sided execution.
    N.B.: For optimal CPU-sided execution of VASP one would normally experiment with different settings for NCORE as well. When running on GPUs anything different from NCORE=1 will adversely affect performance, and VASP will automatically switch to NCORE=1, even if otherwise specified in the INCAR file.

Credits

A special thanks goes out to: Stefan Maintz, Markus Wetzstein, Alexey Romanenko, and Andreas Hehn from NVIDIA for all their help porting VASP to GPU using OpenACC!

Related Tags and Sections

Installing VASP.6.X.X, makefile.include.linux_nv_acc, makefile.include.linux_nv_acc+omp+mkl, makefile.include.linux_pgi_acc, Hybrid MPI/OpenMP parallelization


Contents