My Community

Posted: **Tue Aug 08, 2023 7:17 am**

Dear VASP developers,

When running VASP NCCL on one GPU using one OpenMP thread, it works fine and completes a single-point calculation within about one minute. The same system with multiple OpenMP threads (but still on one GPU), vasp_std gets stuck almost in the beginning and the last text in OUTCAR is

First call to EWALD: gamma= 0.156
Maximum number of real-space cells 3x 3x 3
Maximum number of reciprocal cells 3x 3x 3

FEWALD: cpu time 0.0617: real time 0.0043

I have used a Makefile almost identical to Makefile.include.nvhpc_ompi_mkl_omp_acc as proposed on wiki page:
wiki/index.php/OpenACC_GPU_port_of_VASP

I let it go much longer and no progress and no more output in OUTCAR.

VASP version: 6.3.2
Compiler: NVHPC 21.11

I would appreciate any help!

Best regards
Ghasemi

Posted: **Tue Aug 08, 2023 11:58 am**

Dear ghasemi,

How many OMP threads did you use in this calculation?
Can you run this calculation with multiple OMP threads but without GPUs?

Posted: **Tue Aug 08, 2023 12:21 pm**

Dear Alexey,

I have tried different number of OMP threads, e.g. 2 and 4 and 16, where the last is the maximum I can use for the allocation of single-GPU. Certainly, I have used one MPI process as it must be done when running with NCCL.
Yes, The same version of VASP for the same input files like INCAR,POSCAR, etc has been tested in hybrid mode, MPI+OpenMP, running with various number of MPI processes and OpenMP threads. However, the binary was with build with intel.
I will test it with NVHPC without GPU and post again.

Posted: **Tue Aug 08, 2023 12:31 pm**

Thank you. It would also be a good idea to test it with a more recent version of NVHPC. NVHPC 21.11 was release in 2021.

Posted: **Wed Aug 09, 2023 7:56 am**

Rebuilding VASP with NVHPC 21.11 without OpenACC, the same system runs fine (however, slow as expected) without GPU with 1 and 16 threads.
In another test, I used NVHPC 22.5 with OpenACC, the same system runs fine with GPU using 1 and 16 OpenMP threads. However, in this binary, I did not link to HDF5, therefore, it may be a problem related to the compiler or due to the link to HDF5.

How much work is left for CPU when running on GPU?
How much gain in performance one can expect when using more than one OpenMP threads?
The majority of GPU-enabled VASP benchmarks focus on speedup as a function of the number of GPUs or compare NCCL with no NCCL?

Posted: **Wed Aug 09, 2023 9:55 am**

However, in this binary, I did not link to HDF5, therefore, it may be a problem related to the compiler or due to the link to HDF5.

Looks like a compiler issue. I doubt that HDF5 is a problem here, but it can be easily tested.

How much work is left for CPU when running on GPU?
How much gain in performance one can expect when using more than one OpenMP threads?
The majority of GPU-enabled VASP benchmarks focus on speedup as a function of the number of GPUs or compare NCCL with no NCCL?

We usually do our performance tests with NCCL, which can yield a performance gain of 20-30% for our standard electronic minimization calculation thanks to the asynchronous communication. However, NCCL can only handle one MPI rank per GPU, which means that on a multicore CPU only one core is being used. To improve the situation one can use multiple OpenMP threads to increase the utilization of the CPU cores. But one should keep in mind that all the heavy parts of the calculation are ported to GPUs, so the performance gain from using OpenMP threads is usually not very large.

Posted: **Wed Aug 09, 2023 10:49 am**

Thanks for the reply.

Posted: **Thu Aug 10, 2023 8:42 pm**

alexey.tal wrote: ↑Wed Aug 09, 2023 9:55 am
However, in this binary, I did not link to HDF5, therefore, it may be a problem related to the compiler or due to the link to HDF5.
Looks like a compiler issue. I doubt that HDF5 is a problem here, but it can be easily tested.

How much work is left for CPU when running on GPU?
How much gain in performance one can expect when using more than one OpenMP threads?
The majority of GPU-enabled VASP benchmarks focus on speedup as a function of the number of GPUs or compare NCCL with no NCCL?
We usually do our performance tests with NCCL, which can yield a performance gain of 20-30% for our standard electronic minimization calculation thanks to the asynchronous communication. However, NCCL can only handle one MPI rank per GPU, which means that on a multicore CPU only one core is being used. To improve the situation one can use multiple OpenMP threads to increase the utilization of the CPU cores. But one should keep in mind that all the heavy parts of the calculation are ported to GPUs, so the performance gain from using OpenMP threads is usually not very large.

Your reply here is super useful and instructive. Thanks a lot, Alexey.

My Community

VASP NCCL + OpenACC + OpenMP

VASP NCCL + OpenACC + OpenMP

Re: VASP NCCL + OpenACC + OpenMP

Re: VASP NCCL + OpenACC + OpenMP

Re: VASP NCCL + OpenACC + OpenMP

Re: VASP NCCL + OpenACC + OpenMP

Re: VASP NCCL + OpenACC + OpenMP

Re: VASP NCCL + OpenACC + OpenMP

Re: VASP NCCL + OpenACC + OpenMP