Compiling with AOCC/AOCL OpenMPI SGRGEN error
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 16
- Joined: Fri Oct 20, 2023 1:13 pm
Compiling with AOCC/AOCL OpenMPI SGRGEN error
My system admin an I are trying some new things with VASP. He got a new node to try out with large cache AMD server chips (2x AMD EPYC 9684X 96-Cores per node, with each CPU having 1152MB L3 cache). We wanted to test how VASP simulations scale on this machine and compare it to our local cluster (2x AMD EPYC 9654 96-Cores, with each CPU having 384MB L3 cache). In the end, this is just a comparison in how VASP utilizes cache and what it does for its efficiency.
Besides compiling it the traditional FOSS way (GCC + OpenMPI + OpenBlas + Netlib-Scalapac + FFTW, which worked fine and performed better on the larger cache chips), we also wanted to see how the AOCC and AOCL compiler and math libraries and OpenMPI would change its performance. We assume that these should be better in taking advantage of the large cache amounts. However, although our compilation reports that it finishes successfully, we do get crashes when we try to run our example simulation (```VERY BAD NEWS! internal error in subroutine SGRGEN: Too many elements 49 ----> I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <---- ```). Looking around, we see that there were some forum discussions on this topic before (see: https://wwww.vasp.at/forum/viewtopic.ph ... GEN#p17814, or https://wwww.vasp.at/forum/viewtopic.ph ... GEN#p17686). However, the solution links to a webpage that is not available anymore. (http://cms.mpi.univie.ac.at/vasp-forum/ ... GEN#p17686). As it is indicated that the issues might be caused by the MPI implementation, we are currently rechecking the compilation and the environment variables during runtime. I will send the exact compiler versions and settings, the makefile.include, as well as the compile output and the crashed simulation output later today. I assume that this is necessary to solve these issues.
On another note. We are using a small AIMD example study to assess the performance. However, this uses 374 atoms and has a single gamma-point simulation. Does someone have another real system to check the performance difference with (about two hours runtime on 4 cores but of course faster when we try with 192 cores) to check if the cache matters? Preferably something which is not AIMD (but pure DFT or structure optimization), with more k-points and which is already somewhat optimized in the INCAR for larger number of cores. I will share the results afterwards.
Regards,
Jelle Lagerweij
-
- Global Moderator
- Posts: 531
- Joined: Mon Nov 04, 2019 12:44 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
https://www.vasp.at/wiki/index.php/Validation_tests
After that please send important files like stdout, OUTCAR, INCAR, POSCAR, POTCAR, KPOINTS from any job that failed. Preferably from the smallest job.
Please also send your makefile.include for compilation.
-
- Newbie
- Posts: 16
- Joined: Fri Oct 20, 2023 1:13 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
Thanks for your reply and sorry that it took a while to answer. I prepared some files which were wrong and needed to improve them. The runtime issue still stays the same though. My college will try to use the standard
Code: Select all
make test
I have added the makefile.include and the compiler output in the compiling subfolder and the stdout, OUTCAR, INCAR, POSCAR, POTCAR, KPOINTS in the runcase subfolder. I also reran these simulations in my working install (I have two versions: 1) gcc11 + openmpi + mkl +hdf5 and 2) gcc11 + openmpi + openblas + netlib-scalapack + fftw + hdf5). In both, they worked flawlessly. Therefore, I assumed that nothing is wrong in the input files themselves. The example case is a 25 time steps AIMD simulation which should take approximately 15 minutes to run. It was already initially this short because we were experimenting with parallel efficiency testing.
Additionally, before running the AOCC compiled version, we made sure that the correct environment was set up (as the machine has multiple environments available). Therefore, we used the following code to start the simulation:
Code: Select all
. /opt/AMD/setenv_AOCC.sh
export PATH=/opt/openmpi-5.0.2-aocc/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi-5.0.2-aocc/lib:/opt/amd-fftw/lib:/opt/amd-scalapack/lib/LP64:/opt/amd-blis/lib/LP64:/opt/amd-libflame/lib/LP64:$LD_LIBRARY_PATH
OMP_NUM_THREADS=1 time mpirun -np 32 /home/grepit/TestCase_Gerben/vasp.6.4.2-AOCC/bin/vasp_gam
Kind regards,
Jelle Lagerweij
-
- Global Moderator
- Posts: 531
- Joined: Mon Nov 04, 2019 12:44 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
So could you run any calculation (the ones from the testsuite) or do you get the error message everytime?
-
- Newbie
- Posts: 16
- Joined: Fri Oct 20, 2023 1:13 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
I believe that we got this error every time. The machine was with someone else (our system administrator), but he mentioned that he got this issue in all test cases when using the aocc or the intel compilers (although both compiled successfully). He also used the standard gcc+openmpi+openblas+netlib-scalapack+fftw installation method. In that case, everything worked fine.
I am currently trying the manual provided by AMD themselves (https://www.amd.com/en/developer/zen-so ... /vasp.html). The only drawback is that vasp is licensed software and that spack uses a checksum on the compressed folder to see if you have a correct version. This is totally fine to me, except that no version 6.4.2 is implemented in spack at this point, but my compressed vasp files are (and I rely on vasp 6.4+ features in some larger simulations). I am currently adjusting the spack installation method myself (after spack install, I use <spack edit vasp> and added the version 6.4.2 with the checksum I retrieved from my official vasp 6.4.2 version). I want to see how this works as well.
Kind regards,
Jelle Lagerweij
-
- Global Moderator
- Posts: 531
- Joined: Mon Nov 04, 2019 12:44 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
I hope it helps what you wrote.
If possible please try first this compilers/toolchains:
3.2.0_aocl-3.1_ompi-4.1.2, amdscalapack/3.1, amdblis/3.1
This is what we use and it is very stable.
-
- Newbie
- Posts: 9
- Joined: Mon Dec 10, 2012 7:15 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
On Mar 01, we compiled VASP 6.4.2 with OpenMPI and OpenMP on AMD 2X EPYC 7713 cluster (128 cores per node) after module loading aocc/4.1.0 openmpi/4.1.6 amdblis/4.1 amdlibflame/4.1 amdscalapack/4.1 amdfftw/4.1. The compilation seemed to be successful but calculations always stop and give the following error:
| VERY BAD NEWS! internal error in subroutine SGRGEN: Too many |
| elements 49 |
-
- Newbie
- Posts: 25
- Joined: Wed Jul 20, 2022 7:18 am
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
Hi, have you solved this problem? I have met same problem at only one specific model. This error vanished when calculating other models.jelle_lagerweij wrote: ↑Wed Feb 21, 2024 9:48 am Dear Ferenc,
I believe that we got this error every time. The machine was with someone else (our system administrator), but he mentioned that he got this issue in all test cases when using the aocc or the intel compilers (although both compiled successfully). He also used the standard gcc+openmpi+openblas+netlib-scalapack+fftw installation method. In that case, everything worked fine.
I am currently trying the manual provided by AMD themselves (https://www.amd.com/en/developer/zen-so ... /vasp.html). The only drawback is that vasp is licensed software and that spack uses a checksum on the compressed folder to see if you have a correct version. This is totally fine to me, except that no version 6.4.2 is implemented in spack at this point, but my compressed vasp files are (and I rely on vasp 6.4+ features in some larger simulations). I am currently adjusting the spack installation method myself (after spack install, I use <spack edit vasp> and added the version 6.4.2 with the checksum I retrieved from my official vasp 6.4.2 version). I want to see how this works as well.
Kind regards,
Jelle Lagerweij
-
- Newbie
- Posts: 16
- Joined: Fri Oct 20, 2023 1:13 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
small update, I have not been able to solve this issue and neither has my system administrator. The testing machine (with extra large cashing) is not available to us anymore, and I went back to using openmpi/openblas/netlib-scalapack/fftw3 installation with gcc11. I am still not sure what is exactly the issue, I created the tool chain mentioned by Ferenc with spack, but still had some issues while compiling and my old installation was working fine. We were just unsure if we got the most out of our compute time and interested in how impactful the change in compiler and math libraries would be.
Kind regards,
Jelle
-
- Newbie
- Posts: 9
- Joined: Mon Dec 10, 2012 7:15 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
-
- Newbie
- Posts: 9
- Joined: Mon Dec 10, 2012 7:15 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
-
- Newbie
- Posts: 20
- Joined: Wed Nov 06, 2019 3:12 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
To whom it may concern,
I have compiled VASP 6.5.1 on a cluster of dual-node AMD EPYC 7713 64C 2GHz nodes with GCC 13.2, openmpi 5.0.2, and AOCL 4.1.0. AOCL 4.1.0 provides flame (LAPACK), blis (BLAS), fftw, and ScaLAPACK. I used the makefile.include.gnu_ompi_aocl arch file distributed with vasp with minor modifications. This build passes all the tests in the vasp test suite.
I have also compiled VASP 6.5.1 with Openmpi 5.0.7 and Zen Studio 5.0.0, which contains the AOCC compilers and AOCL Libraries. I needed to append "<path to gcc libraries>\libgcc.a -lunwind" to LLIBS at the end of the makefile.include.aocc_ompi_aocl arch file. I also needed to define the variable LDFLAGS="--rtlib=compiler-rt -lunwind" to configure Openmpi 5.0.7.
The AOCC-compiled vasp binaries are one percent faster than the GCC-compiled binaries, which was hardly worth the additional effort of compiling OpenMPI and VASP with AOCC.
I recommend building VASP with a recent version of GCC and using the AOCL libraries (compiled with GCC).