Hi,
I encountered a reproducible issue running VASP 6.4.2 on an HPC system using NVIDIA V100 32GB SXM2 GPUs and Lustre-backed project directories. The job setup is standard, using gamma-point VASP compiled with the NVHPC toolkit.
When INCAR is located directly in the Lustre filesystem (i.e., inside the job’s working directory), VASP fails with a CUDA out-of-memory error during initialization (right before entering the main loop):
running 2 mpi-ranks, with 1 threads/rank, on 1 nodes
distrk: each k-point on 2 cores, 1 groups
distr: one band on 1 cores, 2 groups
OpenACC runtime initialized ... 2 GPUs detected
vasp.6.4.2 20Jul23 (build Nov 18 2024 12:20:25) gamma-only
POSCAR found type information on POSCAR Ag
POSCAR found : 1 types and 577 ions
scaLAPACK will be used selectively (only on CPU)
LDA part: xc-table for Pade appr. of Perdew
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ... GRIDC
FFT: planning ... GRID_SOFT
FFT: planning ... GRID
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
Out of memory allocating 4040409600 bytes of device memory
Failing in Thread:1
Out of memory allocating 4040409600 bytes of device memory
total/free CUDA memory: 34079637504/2775973888
Failing in Thread:1
Present table dump for device[2]: NVIDIA Tesla GPU 1, compute capability 7.0, threadid=1
total/free CUDA memory: 34079637504/1972764672
However, I found that simply moving the INCAR file to the user’s NFS-backed $HOME directory and symlinking it back into the Lustre job directory fully resolves the issue. But I have trouble understanding this fix.