Page 1 of 1

VASP on multiple nodes each with multiple GPUs

Posted: Fri Sep 22, 2023 1:02 pm
by ghasemi
Hi,

Does anyone have a SLURM job script to run VASP with GPUs on multiple nodes?

In my script with no particular settings related to CUDA or NVHPC or OpenACC or NCCL,
I get a good scaling for VASP from 1 up to 8 GPUs but within one node. Running on two nodes(i.e. 16 GPUs) is slower than one node (8 GPUs). However, I find benchmarks of VASP GPU on nvidia page up to many nodes for a system of about 700 atoms.
My system has about 500 atoms, therefore, I would expect to obtain speedup up to a few nodes at least.
The HPC cluster has InfiniBand.

I have also compared running on two GPUs in two ways, (i) both GPUs on one node, (ii) two nodes each with one GPU. The latter is about 20% slower.

I wonder whether I need a particular setting to run on more than one node?

Thank you in advance!
Alireza

Re: VASP on multiple nodes each with multiple GPUs

Posted: Mon Nov 20, 2023 3:21 pm
by alexey.tal
Dear Alireza,
Does anyone have a SLURM job script to run VASP with GPUs on multiple nodes?
When setting up a slurm job for running your calculation on GPUs, it is important to choose the number of tasks per node to be equal to the number of GPUs per node. This way you would be able to benefit from the asynchronous communication enabled by the NCCL library.

To better understand what types of calculations and tests you have done, I would need more information. Could you please provide the input and output files for your calculations (see guidelines). Also, it would be helpful if you could attach your makefile.include, so that I can see what toolchains and libraries you are using.