Page 1 of 1

GPU Job Fails Unless INCAR is Symlinked from NFS Instead of Lustre

Posted: Thu Jul 17, 2025 7:30 am
by Zhiyuan Yin

Hi,

I encountered a reproducible issue running VASP 6.4.2 on an HPC system using NVIDIA V100 32GB SXM2 GPUs and Lustre-backed project directories. The job setup is standard, using gamma-point VASP compiled with the NVHPC toolkit.
When INCAR is located directly in the Lustre filesystem (i.e., inside the job’s working directory), VASP fails with a CUDA out-of-memory error during initialization (right before entering the main loop):

running 2 mpi-ranks, with 1 threads/rank, on 1 nodes
distrk: each k-point on 2 cores, 1 groups
distr: one band on 1 cores, 2 groups
OpenACC runtime initialized ... 2 GPUs detected
vasp.6.4.2 20Jul23 (build Nov 18 2024 12:20:25) gamma-only
POSCAR found type information on POSCAR Ag
POSCAR found : 1 types and 577 ions
scaLAPACK will be used selectively (only on CPU)
LDA part: xc-table for Pade appr. of Perdew
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ... GRIDC
FFT: planning ... GRID_SOFT
FFT: planning ... GRID
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
Out of memory allocating 4040409600 bytes of device memory
Failing in Thread:1
Out of memory allocating 4040409600 bytes of device memory
total/free CUDA memory: 34079637504/2775973888
Failing in Thread:1
Present table dump for device[2]: NVIDIA Tesla GPU 1, compute capability 7.0, threadid=1
total/free CUDA memory: 34079637504/1972764672

However, I found that simply moving the INCAR file to the user’s NFS-backed $HOME directory and symlinking it back into the Lustre job directory fully resolves the issue. But I have trouble understanding this fix.


Re: GPU Job Fails Unless INCAR is Symlinked from NFS Instead of Lustre

Posted: Thu Jul 17, 2025 1:18 pm
by christopher_sheldon1

Hi Zhiyuan,

Thank you for reporting this. That is unusual behaviour. Could you upload the INCAR, POSCAR, KPOINTS, OUTCAR, and your log files and I'll try to reproduce it?

Best wishes,

Chris


Re: GPU Job Fails Unless INCAR is Symlinked from NFS Instead of Lustre

Posted: Thu Jul 17, 2025 6:39 pm
by Zhiyuan Yin

Dear Christopher,

Here is my input and log for the failed and successful runs.

Please see the attachments.