GPU Job Fails Unless INCAR is Symlinked from NFS Instead of Lustre

Queries about input and output files, running specific calculations, etc.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
Zhiyuan Yin
Newbie
Newbie
Posts: 2
Joined: Wed May 28, 2025 4:02 am

GPU Job Fails Unless INCAR is Symlinked from NFS Instead of Lustre

#1 Post by Zhiyuan Yin » Thu Jul 17, 2025 7:30 am

Hi,

I encountered a reproducible issue running VASP 6.4.2 on an HPC system using NVIDIA V100 32GB SXM2 GPUs and Lustre-backed project directories. The job setup is standard, using gamma-point VASP compiled with the NVHPC toolkit.
When INCAR is located directly in the Lustre filesystem (i.e., inside the job’s working directory), VASP fails with a CUDA out-of-memory error during initialization (right before entering the main loop):

running 2 mpi-ranks, with 1 threads/rank, on 1 nodes
distrk: each k-point on 2 cores, 1 groups
distr: one band on 1 cores, 2 groups
OpenACC runtime initialized ... 2 GPUs detected
vasp.6.4.2 20Jul23 (build Nov 18 2024 12:20:25) gamma-only
POSCAR found type information on POSCAR Ag
POSCAR found : 1 types and 577 ions
scaLAPACK will be used selectively (only on CPU)
LDA part: xc-table for Pade appr. of Perdew
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ... GRIDC
FFT: planning ... GRID_SOFT
FFT: planning ... GRID
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
Out of memory allocating 4040409600 bytes of device memory
Failing in Thread:1
Out of memory allocating 4040409600 bytes of device memory
total/free CUDA memory: 34079637504/2775973888
Failing in Thread:1
Present table dump for device[2]: NVIDIA Tesla GPU 1, compute capability 7.0, threadid=1
total/free CUDA memory: 34079637504/1972764672

However, I found that simply moving the INCAR file to the user’s NFS-backed $HOME directory and symlinking it back into the Lustre job directory fully resolves the issue. But I have trouble understanding this fix.


christopher_sheldon1
Global Moderator
Global Moderator
Posts: 99
Joined: Mon Mar 25, 2024 1:36 pm

Re: GPU Job Fails Unless INCAR is Symlinked from NFS Instead of Lustre

#2 Post by christopher_sheldon1 » Thu Jul 17, 2025 1:18 pm

Hi Zhiyuan,

Thank you for reporting this. That is unusual behaviour. Could you upload the INCAR, POSCAR, KPOINTS, OUTCAR, and your log files and I'll try to reproduce it?

Best wishes,

Chris


Post Reply