Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU

Message

samuel.d.young29.ctr · #1 Post by **samuel.d.young29.ctr** » Tue Mar 24, 2026 7:28 pm

Dear VASP admins and devs,

Thanks so much for the hard work on the first release of the port for AMD/Intel GPUs. We have a Cray machine with some GPU nodes (four MI300A GPUs/node) on which we want to run VASP and are trying to get the 6.6.0 version built with GNU Make. We started from the "cray_omp_off" makefile.include template:

Code: Select all

# Precompiler options
CPP_OPTIONS = -DHOST=\"LinuxFTN\" \
              -DMPI -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -DMPI_INPLACE \
              -Dvasp6 \
              -Dtbdyn \
              -Dfock_dblbuf

# activate OpenMP and gpu offloading
CPP_OPTIONS += -D_OPENMP \
               -DOMP_OFFLOAD \
               -DCRAYHIP

CPP        = cpp --traditional -E -P -Wno-endif-labels $*$(FUFFIX) >$*$(SUFFIX) $(CPP_OPTIONS)

FC         = ftn -hnoacc -homp
FCL        = $(FC)

FREE       = -ffree -N 1023

FFLAGS     = -dC -rmo -emEb
# lower the ipa level for inlining to 0 to avoid compiler problems
FFLAGS     += -hipa0
# suppress warnings
FFLAGS     += -m 4 

# O2 recommended for optimal GPU performance, O1 significantly slower in certain
# GPU kernels
OFLAG      = -O2
OFLAG_IN   = $(OFLAG)
DEBUG      = -O0

# fine grain control over lapack, by default ftn will link libsci with the
# appropriate configuration
# LAPACK     = -L${CRAY_LIBSCI_PREFIX_DIR}/lib -lsci_cray_mpi
# LLIBS      = $(LAPACK)

# FFTW_ROOT  ?= /opt/cray/pe/fftw/3.3.8.11/x86_rome
LLIBS      += -L$(FFTW_ROOT)/lib -lfftw3 -lfftw3_omp
INCS       = -I$(FFTW_ROOT)/include

# HIP
CLANG      = cc

# ROCM_PATH  ?= /opt/rocm
HIPCC      ?= ${ROCM_PATH}/bin/hipcc

ROCM_INCS  = -I${ROCM_PATH}/include -I${ROCM_PATH}/include/hip -I${ROCM_PATH}/include/rocblas -I${ROCM_PATH}/include/rocsolver -I${ROCM_PATH}/include/rocfft

ROCM_LIBS  = -L${ROCM_PATH}/hip/lib -lamdhip64 \
             -L${ROCM_PATH}/lib -lrocblas -lrocfft -lrocsolver -lcraymp

# using RCCL aka NCCL for direct multi-GPU communication, recommended for best
# performance
CPP_OPTIONS += -DUSENCCL
ROCM_LIBS   += -lrccl

LLIBS      += $(ROCM_LIBS)

LIBS       += HIP
LLIBS      += -LHIP -lHipInterface

#
# For what used to be vasp.5.lib
CPP_LIB    = $(CPP)
FC_LIB     = $(FC)
CC_LIB     = cc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB   = $(FREE)

OBJECTS_LIB= linpack_double.o getshmem.o

# For the parser library
CXX_PARS   = CC
LLIBS      += -lstdc++

# Normally no need to change this
SRCDIR     = ../../src
BINDIR     = ../../bin

# HDF5-support (optional but strongly recommended, and mandatory for some
# features)
CPP_OPTIONS+= -DVASP_HDF5
# HDF5_ROOT  ?= /path/to/your/hdf5/installation
LLIBS      += -L$(HDF5_ROOT)/lib -lhdf5_fortran
INCS       += -I$(HDF5_ROOT)/include

# For the VASP-2-Wannier90 interface (optional)
#CPP_OPTIONS    += -DVASP2WANNIER90
#WANNIER90_ROOT ?= /path/to/your/wannier90/installation
#LLIBS          += -L$(WANNIER90_ROOT)/lib -lwannier

# Get major version of crayftn
CRAYFTNVER=$(shell crayftn --version 2>/dev/null | grep "Version" | sed -n 's/.*Version \([0-9]\+\)\..*/\1/p')
CPP_OPTIONS += -D__DCRAYFTN_VERSION=$(CRAYFTNVER)

### special cray workarounds cce v19.0.0, remove for cce20
 # error Unsupported OpenMP construct Calls -- _cray_dv_broadcast : W_G%CPTWFP=0
 OBJECTS_O2 += rot.o
 # fexcg has to be higher optimization level for kernel not too spill
 OBJECTS_O2 += fexcg.o mbj.o ldalib.o ggalib.o mggalib.o
 # error: unexpected type in TYPE_DEREF l818 (copyin_wavefun1_array)
 OBJECTS_O1 += openmp.o
 # error: unexpected type in TYPE_DEREF l724 (twoelectron4o_acc)
 OBJECTS_O1 += twoelectron4o.o
 # error: unexpected type in TYPE_DEREF l377 (calculate_local_field_fock)
 OBJECTS_O1 += local_field.o
 # for the next problem we use OBJECTS_O3 to remove omp
 FFLAGS_3 += -hnoomp
 # error: Found inner_ref/inner_def object without Fortran internal procedure  l5515
 OBJECTS_O3 += bse.o
 # error: Found inner_ref/inner_def object without Fortran internal procedure l1644
 OBJECTS_O3 += GG_base.o
 # MLFF problems with ISTART=2
 OBJECTS_O1 += ml_ff_math.o ml_ff_ff2.o
#################

On the Cray machine, we have the following libraries/frameworks loaded:

Code: Select all

Currently Loaded Modulefiles:
craype-x86-genoa
libfabric/1.22.0
craype-network-ofi
perftools-base/25.03.0
xpmem/2.11.5-1.3_g73ade43320bc
cce/19.0.0
craype/2.7.34
cray-dsmml/0.3.1
cray-mpich/8.1.32
cray-libsci/25.03.0
PrgEnv-cray/8.6.0
cray-libpals/1.6.1
cray-pals/1.6.1
bct-env/0.2
mpscp/1.3a
rocm/6.3.0
cray-hdf5/1.14.3.5
craype-accel-amd-gfx942
cray-fftw/3.3.10.10

The build process works fine, but when testing the binary on a representative system, I'm unable to start the main loop when using more than one GPU or MPI rank.

My Slurm submission script is like this:

Code: Select all

#!/bin/bash
#SBATCH --job-name=test-amdgpu-vasp
#SBATCH --account=<account_name>
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --constraint=gpu
#SBATCH --cpus-per-task=4
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
#SBATCH --exclusive
#SBATCH --time=30:00
#
#SBATCH --requeue
#SBATCH --open-mode=append

# Use 4-8 OpenMP threads, as recommended in
https://vasp.at/wiki/GPU_ports_of_VASP#Environment_variables
export OMP_NUM_THREADS=4
export OMP_PLACES=threads
export OMP_PROC_BIND=spread

# Setting offload env vars as described in
https://vasp.at/wiki/GPU_ports_of_VASP#Environment_variables.
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_STACKSIZE=2048m

# Remove STOPCAR file so job isn't blocked
if [ -f "STOPCAR" ]; then
    rm STOPCAR
fi

# Load dynamic libraries needed by VASP.
module load cray-mpich
module load rocm/6.3.0
module load cray-hdf5/1.14.3.5
module load craype-accel-amd-gfx942
module load cray-fftw/3.3.10.10

# Ensure that stack size is unlimited.
ulimit -s unlimited

# Start VASP binary.

# This fails.
# mpirun -np 4 --bind-to core ./vasp_gam

# This fails as well.
# srun --unbuffered --cpu-bind=cores --gpu-bind=none ./vasp_gam

# And this fails as well.
mpirun -np 4 --cpu-bind=core --gpu-bind=none ./vasp_gam
# mpirun -np 2 ./vasp_gam

wait

In stdout, I'm seeing all GPUs detected and offloading initialize successfully:

Code: Select all

 running    4 mpi-ranks, with    4 threads/rank, on    1 nodes
 distrk:  each k-point on    4 cores,    1 groups
 distr:  one band on    1 cores,    4 groups
 Offloading initialized ...    4 GPUs detected
 vasp.6.6.0 06Mar2026 (build Mar 24 2026 09:52:28) gamma-only
 POSCAR found type information on POSCAR <redacted>
 POSCAR found :  7 types and    1038 ions
 Reading from existing POTCAR
 scaLAPACK will be used
 Reading from existing POTCAR
 -----------------------------------------------------------------------------
|                                                                             |
|               ----> ADVICE to this user running VASP <----                  |
|                                                                             |
|     You enforced a specific xc type in the INCAR file but a different       |
|     type was found in the POTCAR file.                                      |
|     I HOPE YOU KNOW WHAT YOU ARE DOING!                                     |
|                                                                             |
 -----------------------------------------------------------------------------

When running with a single MPI rank and on a single MI300A GPU, VASP will then successfully enter the main loop and start SCF cycles. But with multiple MPI ranks, I instead get job failure, CPU and GPU core dumps, and sterr that typically looks like this:

Code: Select all

Memory access fault by GPU node-4 (Agent handle: 0x1f1e6770) on address 0x14614ec00000. Reason: Unknown.
Memory access fault by GPU node-4 (Agent handle: 0x2598e770) on address 0x152048e04000. Reason: Unknown.
Memory access fault by GPU node-4 (Agent handle: 0x17a4b770) on address 0x145c9c001000. Reason: Unknown.
srun: error: nid-ai05: task 2: Aborted (core dumped)
srun: Terminating StepId=135072.0
slurmstepd: error: *** STEP 135072.0 ON nid-ai05 CANCELLED AT 2026-03-24T18:38:23 ***
srun: error: nid-ai05: task 0: Terminated
srun: error: nid-ai05: tasks 1,3: Aborted (core dumped)
srun: Force Terminated StepId=135072.0

The errors persist whether I use mpirun (from Cray PALS) or Slurm's srun to launch the job, and whether or not I enable RCCL via the -DUSENCCL option in the makefile.include. I am using v19.0.0 of the Cray compilers, with the necessary optimization fixes at the bottom of the makefile.

I've asked my local HPC admin for help, but wanted to know if you've seen anything similar in your testing and if there's anything I'm doing wrong?

Thanks in advance!

#2 Post by **ahampel** » Tue Mar 24, 2026 8:52 pm

Hi,

thank you for trying the new OpenMP offloading feature and sorry for the problems. Let's try to iron things out. I had some trouble on MI300 systems with correct mpi rank placement. You say this only occurs when using more than 1 MPI rank right? And this also happens for whatever input files you use?

Can you try (just to be safe and to replicate our setup on the test server) to set this in the slurm job as well:

Code: Select all

export MPICH_OFI_NIC_POLICY=GPU                 
export MPICH_OFI_RMA_STARTUP_CONNECT=1          
export MPICH_OFI_STARTUP_CONNECT=1              
                                                
export MPICH_ALLTOALL_SHORT_MSG=4096            
export MPICH_ALLTOALL_SYNC_FREQ=24              
export MPICH_GPU_SUPPORT_ENABLED=1              
export MPICH_GPU_IPC_THRESHOLD=8192             
                                                
export NCCL_LAUNCH_ORDER_IMPLICIT=1             
                                                
# RCCL / NCCL async comm                        
MPICH_ASYNC_PROGRESS=1                          
MPICH_GPU_USE_STREAM_TRIGGERED=1                
                                                
# use the new cray-mpich IPC caching mechanism  
export GTL_DISABLE_HSA_IPC_SIGNAL_CACHE=0       
export GPU_MAX_HW_QUEUES=8

I expect / hope that this does not change the problem. Next, can you then try to use the following command for the script:

Code: Select all

srun -n $NRANKS  --gpus=$NRANKS script/for/pin/run_vasp.sh $VASP_DIR/vasp_std

with the attached bash script to pin the mpi ranks. You only have to specify $VASP_DIR here or remove it. The script will be correct for a genoa node with MI300, if the config is similar to what I saw. Maybe make sure that you 192 cores in total on the node. For this test it might be good to take the full node with exclusive as you did.

Let me know if this changes anything about the problem. For me this works. Without pinning each rank appropriately I sometimes see errors. if the problem persists we have to dig deeper.

Best,
Alex

samuel.d.young29.ctr · #3 Post by **samuel.d.young29.ctr** » Tue Mar 24, 2026 9:19 pm

Hi Alex,

Thank for the quick response! Your pinning script appears to work for the non-RCCL build on our GPU nodes, and I'm seeing SCF steps complete. Mind if we keep this ticket open until I can do some more testing with the RCCL-enabled version and more performance testing?

Much thanks,
Sam

#4 Post by **ahampel** » Tue Mar 24, 2026 9:54 pm

Hi Sam,

yes please - I would be curious to figure out why a simple srun or mpirun does not work. I never had this problem on MI250 machines. I am wondering if there is something special about the shared memory of the MI300A. Maybe if the MPI rank ends up on the wrong numa node it actually cannot access the memory correctly? I am a bit puzzled by this.

Let me know how the RCCL build testing goes.

Best,
Alex

samuel.d.young29.ctr · #5 Post by **samuel.d.young29.ctr** » Wed Mar 25, 2026 9:04 pm

Hi Alex,

Just wanted to provide an update. The RCCL build works as well using your pinning script with about a ~20% performance benefit compared to the non-RCCL version. (I have disabled those diagnostic env vars earlier in these tests.)

Still doing some benchmarking out on the exact number of OpenMP threads to use, but for my system I'm seeing the best performance at OMP_NUM_THREADS=2 and OMP_NUM_THREADS=4. Interestingly, I do get a different memory error if I try to disable CPU-side OpenMP parallelization by setting OMP_NUM_THREADS=1 and #SBATCH --cpus-per-task=1:

Code: Select all

ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffe238c4d00 to 7ffe2ad0c500) overlaps present region (7ffe1a51cd00 to 7ffe2915cd00 index 244) but is not contained for 'cr(:)' from fft_base.f90:665
ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffca6a4a280 to 7ffcade91a80) overlaps present region (7ffc9d6a2280 to 7ffcac2e2280 index 244) but is not contained for 'cr(:)' from fft_base.f90:665
ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffc2387dcc0 to 7ffc2acc54c0) overlaps present region (7ffc1a4d5cc0 to 7ffc29115cc0 index 244) but is not contained for 'cr(:)' from fft_base.f90:665
ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffff2a94d00 to 7ffff9edc500) overlaps present region (7fffe96ecd00 to 7ffff832cd00 index 244) but is not contained for 'cr(:)' from fft_base.f90:665
srun: error: nid-ai10: tasks 0-3: Exited with exit code 1

Practically, this isn't much of an issue since I now have a way to run VASP on all four GPUs, and even using two OpenMP threads per MPI task gets me close to the performance of the four-A100 node I've been using, which is good enough for my needs. But it's interesting that I can't run with just a single OpenMP thread.

Any other testing you want me to try on my end?

Best,
Sam

#6 Post by **ahampel** » Thu Mar 26, 2026 4:26 pm

Dear Sam,

I had another look at the issue. Indeed we are missing something in our code in the way we initialize HIP / ROCm. We should fix the GPU for each MPI rank to one GPU. We do this for the OMP runtime but not for HIP/ROCm. I created an internal issue and I will work on a fix to be pushed into the next release. This means that for now we have to rely on SLURM giving each rank only one GPU, or by using a wrapper script. In principle the wrapper script can be much simpler to be something like:

Code: Select all

#!/bin/bash

gpu_map=(0 1 2 3)
myGPU=${gpu_map[SLURM_LOCALID]}
#myGPU=${gpu_map[PMI_RANK]}
echo ${SLURM_LOCALID} " " ${myGPU}                                                                                      
export ROCR_VISIBLE_DEVICES=${myGPU}
exec $*

or maybe even simpler. We somehow overlooked this issue. Thank you for testing!

I tried now with 1 OMP thread per MPI rank and could not find any issues. Is the problem only occuring for a specific job for you? If so can you maybe share your Input files then I can try this as well.

Best,
Alex

#7 Post by **ahampel** » Tue Mar 31, 2026 8:28 am

Dear Sam,

Here is a minimal patch that you can apply to src/openmp.F to fix this issue:

Code: Select all

23a24,30
> #ifdef CRAYHIP
>           SUBROUTINE HIP_SET_DEVICE(DEVID) BIND(C, NAME="hip_set_device")
>              USE iso_c_binding
>              INTEGER(c_int), VALUE :: DEVID
>           END SUBROUTINE HIP_SET_DEVICE
> #endif
>
143a151,154
> #ifdef CRAYHIP
>       ! set the HIP GPU / device to be used by this MPI rank to the same as the OMP runtime so that ROCm uses the same GPU
>       CALL HIP_SET_DEVICE(DEVICE_NUM)
> #endif

We just have to CALL HIP_SET_DEVICE(DEVICE_NUM) in line 143 after the OMP runtime device is set and add an interface to the C function on top. With this change setting ROCR_VISIBLE_DEVICES for each rank specifically is not needed anymore.

Best,
Alex

samuel.d.young29.ctr · #8 Post by **samuel.d.young29.ctr** » Thu Apr 02, 2026 8:50 pm

Hi Alex,

I tried your patch for openmp.F. When using two or more OpenMP threads per MPI rank, VASP launches and runs fine without needing to set $ROCR_VISIBLE_DEVICES or use any wrapper script. When OMP_NUM_THREADS and Slurm --cpus-per-task are set to 1, however, I still get the same "overlapping regions" error as before:

Code: Select all

ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffc314e1b80 to 7ffc38929380) overlaps present region (7ffc28139b80 to 7ffc36d79b80 index 244) but is not contained for 'cr(:)' from fft_base.f90:665
ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffd53862840 to 7ffd5acaa040) overlaps present region (7ffd4a4ba840 to 7ffd590fa840 index 244) but is not contained for 'cr(:)' from fft_base.f90:665
ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffd8f124240 to 7ffd9656ba40) overlaps present region (7ffd85d7c240 to 7ffd949bc240 index 244) but is not contained for 'cr(:)' from fft_base.f90:665
ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7fff496429c0 to 7fff50a8a1c0) overlaps present region (7fff4029a9c0 to 7fff4eeda9c0 index 244) but is not contained for 'cr(:)' from fft_base.f90:665

Not sure I patched this file correctly. Here is the actual unified-style (i.e., diff -u0) diff I made to that file:

Code: Select all

--- openmp-orig.F	2026-04-02 14:20:49.000000000 -0400
+++ openmp.F	2026-04-02 14:25:26.000000000 -0400
@@ -23,0 +24,11 @@
+#ifdef CRAYHIP
+         ! Define interface to C function HIP_SET_DEVICE(DEVICE_NUM) after the
+         ! OMP runtime device has been set. With this approach, we no longer
+         ! need to specify $ROCR_VISIBLE_DEVICES for each rank. See
+         ! https://www.vasp.at/forum/viewtopic.php?p=33311#p33311.
+         SUBROUTINE HIP_SET_DEVICE(DEVID) BIND(C, NAME="hip_set_device")
+               USE iso_c_binding
+               INTEGER(c_int), VALUE :: DEVID
+         END SUBROUTINE HIP_SET_DEVICE
+#endif
+
@@ -143,0 +155,7 @@
+
+#ifdef CRAYHIP
+      ! Set the HIP GPU/device to be used by this MPI rank to the same as the
+      ! OMP runtime so that ROCm uses the same GPU. See
+      ! https://www.vasp.at/forum/viewtopic.php?p=33311#p33311.
+      CALL HIP_SET_DEVICE(DEVICE_NUM)
+#endif

Nevertheless, still very happy that it now runs without needing manual pinning of ranks to cores. Any additional testing you'd like me to do?

Thanks,
Sam

#9 Post by **ahampel** » Fri Apr 03, 2026 9:54 am

Hi Sam,

no my fix should only address the GPU binding for the HIP library. The other problem is still unclear to me. How large is the system you are running? Can you try the std version of VASP?

Another user reported the same problem here: https://www.vasp.at/forum/viewtopic.php?t=20612 . We will investigate.

Best,
Alex

samuel.d.young29.ctr · #10 Post by **samuel.d.young29.ctr** » Fri Apr 03, 2026 3:54 pm

Hi Alex,

This is a system of 1038 ions and 7 types. Memory usage is well within the capacity of each GPU.

Running vasp_std results in the same behavior as vasp_gam: successful launch and run for $OMP_NUM_THREADS and --cpus-per-task set to 2, and the same "overlaps present region" error for $OMP_NUM_THREADS and --cpus-per-task set to 1.

Sam

VASP Forum

Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU

Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU

Re: Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU

Re: Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU

Re: Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU

Re: Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU

Re: Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU

Re: Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU

Re: Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU

Re: Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU

Re: Trouble running CRAYHIP (AMD MI300A) port of VASP 6.6.0 on more than 1 MPI rank or 1 GPU