vasp 6.5.1 installation issue using NVIDIA SDK 25.3

Questions regarding the compilation of VASP on various platforms: hardware, compilers and libraries, etc.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
jaewook_kim
Newbie
Newbie
Posts: 2
Joined: Thu Apr 17, 2025 10:44 am

vasp 6.5.1 installation issue using NVIDIA SDK 25.3

#1 Post by jaewook_kim » Wed May 14, 2025 1:40 pm

after install, i excecuted "make test"

make -C testsuite test
make[1]: 디렉터리 '/home/jaewook/VASP/vasp.6.5.1/testsuite' 들어감
if [ -f tools/compare_numbertable_new ] ; then \
rm tools/compare_numbertable_new ; fi
if [ -f tools/m_strings.mod ] ; then \
rm tools/m_strings.mod ; fi
cd tools ; mpif90 -acc -gpu=cc80,cc86,cuda12.8 -mp -gpu=tripcount:host -o compare_numbertable_new compare_numbertable_new.f90
./runtest --fast 2>&1 | tee testsuite.log
==================================================================
fatal: (현재 폴더 또는 상위 폴더 부터 마운트 위치 / 까지 일부가) 깃 저장소가 아닙니다
파일 시스템 경계에서 중지합니다. (GIT_DISCOVERY_ACROSS_FILESYSTEM 설정되지 않음)
VASP TESTSUITE SHA:

Reference files have been generated with 4 MPI ranks.
Note that tests might fail if an other number of ranks is used!

Executables and additional INCAR tags used for this test:

VASP_TESTSUITE_EXE_STD="mpirun -np 4 /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std"
VASP_TESTSUITE_EXE_NCL="mpirun -np 4 /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_ncl"
VASP_TESTSUITE_EXE_GAM="mpirun -np 4 /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_gam"
VASP_TESTSUITE_INCAR_PREPEND=""
VASP_TESTSUITE_REFERENCE=""

Executed at: 22_31_05/14/25
==================================================================

------------------------------------------------------------------

CASE: CrS
------------------------------------------------------------------
CASE: CrS
entering run_recipe CrS
CrS step STD
------------------------------------------------------------------
CrS step STD
entering run_vasp
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[master:2354784] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[master:2354786] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[master:2354787] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[master:2354785] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
running 4 mpi-ranks, with 2 threads/rank, on 1 nodes
distrk: each k-point on 2 cores, 2 groups
distr: one band on 1 cores, 2 groups
Offloading initialized ... 1 GPUs detected
free(): double free detected in tcache 2
[master:2354784] *** Process received signal ***
free(): double free detected in tcache 2
[master:2354785] *** Process received signal ***
[master:2354785] Signal: Aborted (6)
[master:2354785] Signal code: (-6)
[master:2354784] Signal: Aborted (6)
[master:2354784] Signal code: (-6)
free(): double free detected in tcache 2
free(): double free detected in tcache 2
[master:2354787] *** Process received signal ***
[master:2354787] Signal: Aborted (6)
[master:2354787] Signal code: (-6)
[master:2354786] *** Process received signal ***
[master:2354786] Signal: Aborted (6)
[master:2354786] Signal code: (-6)
[master:2354786] [ 0] [master:2354784] [ 0] [master:2354785] [ 0] /usr/lib64/libc.so.6(+0x3ea00)[0x7fe94ae3ea00]
[master:2354785] [ 1] [master:2354787] [ 0] /usr/lib64/libc.so.6(+0x3ea00)[0x7fbbd7e3ea00]
[master:2354787] [ 1] /usr/lib64/libc.so.6(+0x3ea00)[0x7f77fdc3ea00]
[master:2354786] [ 1] /usr/lib64/libc.so.6(+0x3ea00)[0x7ff09cc3ea00]
[master:2354784] [ 1] /usr/lib64/libc.so.6(+0x8ebec)[0x7ff09cc8ebec]
[master:2354784] [ 2] /usr/lib64/libc.so.6(+0x8ebec)[0x7fe94ae8ebec]
[master:2354785] [ 2] /usr/lib64/libc.so.6(raise+0x16)[0x7fe94ae3e956]
[master:2354785] [ 3] /usr/lib64/libc.so.6(+0x8ebec)[0x7fbbd7e8ebec]
[master:2354787] [ 2] /usr/lib64/libc.so.6(raise+0x16)[0x7fbbd7e3e956]
[master:2354787] [ 3] /usr/lib64/libc.so.6(+0x8ebec)[0x7f77fdc8ebec]
[master:2354786] [ 2] /usr/lib64/libc.so.6(raise+0x16)[0x7f77fdc3e956]
[master:2354786] [ 3] /usr/lib64/libc.so.6(raise+0x16)[0x7ff09cc3e956]
[master:2354784] [ 3] /usr/lib64/libc.so.6(abort+0xcf)[0x7ff09cc287f4]
[master:2354784] [ 4] /usr/lib64/libc.so.6(abort+0xcf)[0x7fbbd7e287f4]
[master:2354787] [ 4] /usr/lib64/libc.so.6(+0x82d3e)[0x7fbbd7e82d3e]
[master:2354787] [ 5] /usr/lib64/libc.so.6(abort+0xcf)[0x7fe94ae287f4]
[master:2354785] [ 4] /usr/lib64/libc.so.6(+0x82d3e)[0x7fe94ae82d3e]
[master:2354785] [ 5] /usr/lib64/libc.so.6(abort+0xcf)[0x7f77fdc287f4]
[master:2354786] [ 4] /usr/lib64/libc.so.6(+0x82d3e)[0x7f77fdc82d3e]
[master:2354786] [ 5] /usr/lib64/libc.so.6(+0x82d3e)[0x7ff09cc82d3e]
[master:2354784] [ 5] /usr/lib64/libc.so.6(+0x9893c)[0x7ff09cc9893c]
[master:2354784] [ 6] /usr/lib64/libc.so.6(+0x9893c)[0x7fbbd7e9893c]
[master:2354787] [ 6] /usr/lib64/libc.so.6(+0x9ac86)[0x7fbbd7e9ac86]
[master:2354787] [ 7] /usr/lib64/libc.so.6(+0x9893c)[0x7fe94ae9893c]
[master:2354785] [ 6] /usr/lib64/libc.so.6(+0x9ac86)[0x7fe94ae9ac86]
[master:2354785] [ 7] /usr/lib64/libc.so.6(+0x9893c)[0x7f77fdc9893c]
[master:2354786] [ 6] /usr/lib64/libc.so.6(+0x9ac86)[0x7f77fdc9ac86]
[master:2354786] [ 7] /usr/lib64/libc.so.6(free+0x73)[0x7f77fdc9d133]
[master:2354786] [ 8] /home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nccl/lib/libnccl.so.2(+0x67d95)[0x7f7847a67d95]
[master:2354786] [ 9] /home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nccl/lib/libnccl.so.2(pncclCommInitRank+0x380)[0x7f7847a689e0]
[master:2354786] [10] /usr/lib64/libc.so.6(+0x9ac86)[0x7ff09cc9ac86]
[master:2354784] [ 7] /usr/lib64/libc.so.6(free+0x73)[0x7ff09cc9d133]
[master:2354784] [ 8] /home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nccl/lib/libnccl.so.2(+0x67d95)[0x7ff0e6a67d95]
[master:2354784] [ 9] /home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nccl/lib/libnccl.so.2(pncclCommInitRank+0x380)[0x7ff0e6a689e0]
[master:2354784] [10] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x4505c7]
[master:2354784] [11] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x510ce7]
[master:2354784] [12] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x1c380c2]
[master:2354784] [13] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x418271]
[master:2354784] [14] /usr/lib64/libc.so.6(free+0x73)[0x7fbbd7e9d133]
[master:2354787] [ 8] /home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nccl/lib/libnccl.so.2(+0x67d95)[0x7fbc21c67d95]
[master:2354787] [ 9] /home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nccl/lib/libnccl.so.2(pncclCommInitRank+0x380)[0x7fbc21c689e0]
[master:2354787] [10] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x4505c7]
[master:2354787] [11] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x510ce7]
[master:2354787] [12] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x1c380c2]
[master:2354787] [13] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x418271]
[master:2354787] [14] /usr/lib64/libc.so.6(+0x29510)[0x7fbbd7e29510]
[master:2354787] [15] /usr/lib64/libc.so.6(free+0x73)[0x7fe94ae9d133]
[master:2354785] [ 8] /home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nccl/lib/libnccl.so.2(+0x67d95)[0x7fe994c67d95]
[master:2354785] [ 9] /home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nccl/lib/libnccl.so.2(pncclCommInitRank+0x380)[0x7fe994c689e0]
[master:2354785] [10] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x4505c7]
[master:2354785] [11] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x510ce7]
[master:2354785] [12] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x1c380c2]
[master:2354785] [13] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x418271]
[master:2354785] [14] /usr/lib64/libc.so.6(+0x29510)[0x7fe94ae29510]
[master:2354785] [15] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x4505c7]
[master:2354786] [11] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x510ce7]
[master:2354786] [12] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x1c380c2]
[master:2354786] [13] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x418271]
[master:2354786] [14] /usr/lib64/libc.so.6(+0x29510)[0x7f77fdc29510]
[master:2354786] [15] /usr/lib64/libc.so.6(__libc_start_main+0x89)[0x7f77fdc295c9]
[master:2354786] [16] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x414055]
[master:2354786] *** End of error message ***
/usr/lib64/libc.so.6(+0x29510)[0x7ff09cc29510]
[master:2354784] [15] /usr/lib64/libc.so.6(__libc_start_main+0x89)[0x7ff09cc295c9]
[master:2354784] [16] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x414055]
[master:2354784] *** End of error message ***
/usr/lib64/libc.so.6(__libc_start_main+0x89)[0x7fbbd7e295c9]
[master:2354787] [16] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x414055]
[master:2354787] *** End of error message ***
/usr/lib64/libc.so.6(__libc_start_main+0x89)[0x7fe94ae295c9]
[master:2354785] [16] /home/jaewook/VASP/vasp.6.5.1/testsuite/../bin/vasp_std[0x414055]
[master:2354785] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node master exited on signal 6 (Aborted).
--------------------------------------------------------------------------
exiting run_vasp
exiting run_recipe CrS
./runtest: 줄 512: OUTCAR: 그런 파일이나 디렉터리가 없습니다
ERROR: the test yields different results for the energies, please check
-----------------------------------------------------------------------
paste: energy_outcar: 그런 파일이나 디렉터리가 없습니다
ERROR: compare_numbertable can't find file energy_outcar
./runtest: 줄 634: OUTCAR: 그런 파일이나 디렉터리가 없습니다
ERROR: the test yields different results for the forces, please check
---------------------------------------------------------------------
cat: force: 그런 파일이나 디렉터리가 없습니다
ERROR: compare_numbertable can't find file force
grep: OUTCAR: 그런 파일이나 디렉터리가 없습니다
/home/jaewook/VASP/vasp.6.5.1/testsuite/tools/compare_numbertable_new: symbol lookup error: /home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libmpi_mpifh.so.40: undefined symbol: mpi_conversion_fn_null_
the stress tensor is:
the tensor is correct, run successful

I only installed std to get this result faster.
Here I attach my makefile.include

Code: Select all

# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxNV\" \
              -DMPI -DMPI_INPLACE -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -Dvasp6 \
              -Dtbdyn \
              -Dqd_emulate \
              -Dfock_dblbuf \
              -D_OPENMP \
              -DACC_OFFLOAD \
              -DNVCUDA \
              -DUSENCCL

CPP         = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX)  > $*$(SUFFIX)

# N.B.: you might need to change the cuda-version here
#       to one that comes with your NVIDIA-HPC SDK
CUDA_VERSION = $(shell nvcc -V | grep -E -o -m 1 "[0-9][0-9]\.[0-9]," | rev | cut -c 2- | rev)

CC          = mpicc -acc -gpu=cc80,cc86,cuda${CUDA_VERSION} -mp
FC          = mpif90 -acc -gpu=cc80,cc86,cuda${CUDA_VERSION} -mp
FCL         = mpif90 -acc -gpu=cc80,cc86,cuda${CUDA_VERSION} -mp -c++libs

FREE        = -Mfree

FFLAGS      = -Mbackslash -Mlarge_arrays

OFLAG       = -fast

DEBUG       = -Mfree -O0 -traceback

LLIBS       = -cudalib=cublas,cusolver,cufft,nccl -cuda

# Redefine the standard list of O1 and O2 objects
SOURCE_O1  := pade_fit.o minimax_dependence.o
SOURCE_O2  := pead.o

# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = $(FC)
CC_LIB      = $(CC)
CFLAGS_LIB  = -O -w
FFLAGS_LIB  = -O1 -Mfixed
FREE_LIB    = $(FREE)
OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS    = nvc++ --no_warnings

##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##
# When compiling on the target machine itself , change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -tp host
FFLAGS     += $(VASP_TARGET_CPU)

# Specify your NV HPC-SDK installation (mandatory)
#... first try to set it automatically
NVROOT      =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')

# If the above fails, then NVROOT needs to be set manually
#NVHPC      ?= /opt/nvidia/hpc_sdk
#NVVERSION   = 21.11
#NVROOT      = $(NVHPC)/Linux_x86_64/$(NVVERSION)

## Improves performance when using NV HPC-SDK >=21.11 and CUDA >11.2
OFLAG_IN   = -fast -Mwarperf
SOURCE_IN  := nonlr.o

# Software emulation of quadruple precsion (mandatory)
QD         ?= $(NVROOT)/compilers/extras/qd
LLIBS      += -L$(QD)/lib -lqdmod -lqd
INCS       += -I$(QD)/include/qd

# BLAS (mandatory)
BLAS        = -lblas

# LAPACK (mandatory)
LAPACK      = -llapack

# scaLAPACK (mandatory)
SCALAPACK   = -Mscalapack

LLIBS      += $(SCALAPACK) $(LAPACK) $(BLAS)

# FFTW (mandatory)
FFTW_ROOT  ?= /home/jaewook
LLIBS      += -L$(FFTW_ROOT)/lib -lfftw3 -lfftw3_omp
INCS       += -I$(FFTW_ROOT)/include

# Use cusolvermp (optional)
# supported as of NVHPC-SDK 24.1 (and needs CUDA-11.8)
CPP_OPTIONS+= -DCUSOLVERMP -DCUBLASMP
LLIBS      += -cudalib=cusolvermp,cublasmp -lnvhpcwrapcal

# HDF5-support (optional but strongly recommended, and mandatory for some features)
#CPP_OPTIONS+= -DVASP_HDF5
#HDF5_ROOT  ?= /path/to/your/hdf5/installation
#LLIBS      += -L$(HDF5_ROOT)/lib -lhdf5_fortran
#INCS       += -I$(HDF5_ROOT)/include

# For the VASP-2-Wannier90 interface (optional)
#CPP_OPTIONS    += -DVASP2WANNIER90
#WANNIER90_ROOT ?= /path/to/your/wannier90/installation
#LLIBS          += -L$(WANNIER90_ROOT)/lib -lwannier

# For the fftlib library (hardly any benefit for the OpenACC GPU port)
#CPP_OPTIONS+= -Dsysv
#FCL        += fftlib.o
#CXX_FFTLIB  = nvc++ -mp --no_warnings -std=c++11 -DFFTLIB_THREADSAFE
#INCS_FFTLIB = -I./include -I$(FFTW_ROOT)/include
#LIBS       += fftlib
#LLIBS      += -ldl

# For machine learning library vaspml (experimental)
#CPP_OPTIONS += -Dlibvaspml
#CPP_OPTIONS += -DVASPML_USE_CBLAS
#CPP_OPTIONS += -DVASPML_DEBUG_LEVEL=3
#CXX_ML      = mpic++ -mp
#CXXFLAGS_ML = -O3 -std=c++17 -Wall -Wextra
#INCLUDE_ML  =

# Add -gpu=tripcount:host to compiler commands for NV HPC-SDK > 25.1
NVFORTRAN_VERSION := $(shell nvfortran --version | sed -n '2s/^nvfortran \([0-9.]*\).*/\1/p')
 define greater_or_equal
$(shell printf '%s\n%s\n' '$(1)' '$(2)' | sort -V | head -n1 | grep -q '$(2)' && echo true || echo false)
endef
ifeq ($(call greater_or_equal,$(NVFORTRAN_VERSION),25.1),true)
    CC  += -gpu=tripcount:host
    FC  += -gpu=tripcount:host
endif

and my nvidia_sdk info

[jaewook@master vasp.6.5.1]$ nvfortran --version

nvfortran 25.3-0 64-bit target on x86-64 Linux -tp icelake-server
NVIDIA Compilers and Tools
Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
[jaewook@master vasp.6.5.1]$ echo $PATH
/home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/extras/qd/bin:/home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/mpi/bin:/home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/bin:/home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/cuda/bin:/usr/share/Modules/bin:/usr/local/cuda-12.0/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/user/sbin:/usr/sbin
[jaewook@master vasp.6.5.1]$ echo $LD_LIBRARY_PATH
/home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nvshmem/lib:/home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nccl/lib:/home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/math_libs/lib64:/home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib:/home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/cuda/lib64:/usr/local/cuda-12.0/lib64:/usr/loal/lib:/usr/local/lib64:/usr/lib:/usr/lib64:/usr/local/cuda-12.0/lib64:/home/openmpi/lib:/home/openmpi/lib:/usr/local/cuda-12.0/lib64:/home/jaewook/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/extras/qd/lib:/usr/local/cuda-12.0/lib64:/home/openmpi/lib/
[jaewook@master vasp.6.5.1]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0
[jaewook@master vasp.6.5.1]$ nvidia-smi
Wed May 14 22:38:25 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A2 Off | 00000000:CA:00.0 Off | 0 |
| 0% 37C P8 6W / 60W | 9MiB / 15356MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
[jaewook@master vasp.6.5.1]$ which mpirun
~/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/mpi/bin/mpirun

I don't know how to solve the problem....
Please help me to install vasp 6.5.1 with MPI aware cuda


jaewook_kim
Newbie
Newbie
Posts: 2
Joined: Thu Apr 17, 2025 10:44 am

Re: vasp 6.5.1 installation issue using NVIDIA SDK 25.3

#2 Post by jaewook_kim » Thu May 15, 2025 3:19 pm

Update:
I found that the OpenMPI in NVIDIA SDK assumes that my system is interconnected with another computing node using Infiniband.
As my server does not have a NIC for Infiniband or ConnectX-7, it seems I should not use the mpicc of the NVIDIA SDK.
I successfully compiled VASP 6.5.1 on the same server by using the MPI compiler provided by Intel oneAPI HPC toolkit (for CPU-only compilation)
So, after I try to install OpenMPI (cuda aware) manually, I'll post the answer as an update.


jonathan_lahnsteiner2
Global Moderator
Global Moderator
Posts: 260
Joined: Fri Jul 01, 2022 2:17 pm

Re: vasp 6.5.1 installation issue using NVIDIA SDK 25.3

#3 Post by jonathan_lahnsteiner2 » Mon May 19, 2025 6:55 am

Dear Jaewook Kim,

Maybe this post could be helpful to take a look forum/viewtopic.php?t=20172.
You could try to set the following environment variables to have a cuda aware MPI:

Code: Select all

export MPICH_GPU_SUPPORT_ENABLED=1 
export PMPI_GPU_AWARE=1

Otherwise I fear there is no way around but compiling cuda aware OpenMPI.

All the best Jonathan


ivr900
Newbie
Newbie
Posts: 6
Joined: Wed Nov 13, 2019 10:03 pm

Re: vasp 6.5.1 installation issue using NVIDIA SDK 25.3

#4 Post by ivr900 » Tue May 27, 2025 3:07 am

Dear Jaewook,

I see many contradictions in your setup, installation and testing procedure.

  • 1. Regarding of testing.
    You have just one GPU according to 'nvidia-smi' info, but try to run a GPU VASP job with 4 MPI ranks. Not that it is impossible, but it does require some extra efforts, like activating Nvidia MPS service (requires root privileges). Without it, only one MPI rank per GPU is allowed.
    For that reason I would suggest you to try to run tests with 1 MPI rank.

    To avoid messing up with 'runtest' script contnent I would suggest you to create a custom file "nvidia-omp-ompi.conf" in vasp.6.5.1/testsuite directory with content like this one below:

    Code: Select all

    # define the commands that run vasp_std, vasp_ncl, and vasp_gam
    #
    nranks=1
    nthrds=12 # may put number of cores per node divided by nranks
    
    mpi="-np $nranks --map-by numa:PE=$nthrds --bind-to core -x OMP_NUM_THREADS=$nthrds"
    
    # For the GNU or NVIDIA OpenMP runtime (gomp/nvomp)
    omp="-x OMP_STACKSIZE=512m -x OMP_PLACES=cores -x OMP_PROC_BIND=close"
    
    export VASP_TESTSUITE_CUDA=Y
    export VASP_TESTSUITE_EXE_STD="mpirun $mpi $omp $PWD/../bin/vasp_std"
    export VASP_TESTSUITE_EXE_NCL="mpirun $mpi $omp $PWD/../bin/vasp_ncl"
    export VASP_TESTSUITE_EXE_GAM="mpirun $mpi $omp $PWD/../bin/vasp_gam"
    

    To use the file in your run your runtest line must be modified as

    Code: Select all

    ./runtest --fast nvidia-omp-ompi.conf 2>&1 | tee testsuite.log 
  • 2. Nvidia driver
    I see that your Nvidia driver version 525.125.06. It is an obsolete version for use with latest Nvidia HPC-SDK. You must update it to 570.

  • 3. CUDA version

I see cuda-12.0 in you PATH and LD_LIBRARY_PATH. Though it goes after /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/cuda in both PATH-es, I would suggest to remove cuda-12.0 it from your environment setup files when Nvidia HPC-SDK/25.3 is used.

Please let us know here, if the suggestions above were helpful, or you have any questions regarding of them.

Kind regards,
Ivan Rostov
NCI Australia, Canberra


Post Reply