Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries

Questions will be moved to this forum when we consider them out of scope for support from our side: for instance when we do not have the necessary expertise to come up with an answer.
Another user still might, though!

Moderators: Global Moderator, Moderator

Post Reply
Message
Author
dantasqu
Newbie
Newbie
Posts: 2
Joined: Mon May 08, 2023 11:45 pm

Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries

#1 Post by dantasqu » Wed Oct 04, 2023 5:52 am

I'm facing an issue in VASP's MLFF with the 'Refit' mode. Despite knowing that there was a problem with 'Incorrect MLFF fast-mode predictions for some triclinic geometries,' I've updated to the latest version and continue to experience the same problem.

With my current simulation, I experience an empty OSZICAR file, and the output file progresses only up to 'initializing machine learning' before remaining stuck indefinitely, regardless of the simulation runtime I set.

I've also attempted to address the issue by reducing the 'ML_AB' configurations (I have reduced it to half of the configurations), as the current number is quite high, but it hasn't yielded any changes in the output results.

I also have read other posts but it seems that many people had some results after the simulation "stopped", so I wonder if this could be yet another memory allocation problem or something else that I could troubleshoot.

I've posted the necessary files for reference on the link below (too large to attach), and any input or guidance on resolving this persistent problem would be greatly appreciated.

Link: https://drive.google.com/drive/folders/ ... sp=sharing

ferenc_karsai
Global Moderator
Global Moderator
Posts: 422
Joined: Mon Nov 04, 2019 12:44 pm

Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries

#2 Post by ferenc_karsai » Wed Oct 25, 2023 9:15 am

So I've run your calculation on 64 cores. After a few hours I get the following:
"xxmr2d:out of memory"

This is inside the scalapack routines for SVD where it redistributes some routines internally. For that it allocates helping arrays that are allocated with malloc. If the size of the helping arrays (which is unfortunately 1D) is larger than 2**31, that means 4 byte integer, this error message comes. The size of this arrays gets smaller and smaller the more computational cores one takes, since the arrays are distributed via the cores and each core only needs to allocate parts of the arrays.

So I reran the calculation with 128 cores and it went through fine.

You ran on 40 cores (I saw it from the OUTCAR) which is definitely not enough, but it's strange you don't get an error.
Please try the calculation with more cores, to be safe at least with 128.

jelle_lagerweij
Newbie
Newbie
Posts: 15
Joined: Fri Oct 20, 2023 1:13 pm

Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries

#3 Post by jelle_lagerweij » Mon Nov 27, 2023 8:42 am

This question and answer was very useful for me as well. I got the same problem when refitting an FF for a liquid phase system. At first, I was quite surprised, as I only used 650 GB of the nearly 1500 GB of memory available, but I still got this error message. However, now I understand that this issue occurs because of allocating the array in the memory instead of the absolute memory size. I am testing the solution (using more cores) and will see how this does in the future. However, I must note that it would be nice if this error could be avoided by adjusting the ML algorithm, as this will cause me to use more high memory nodes than I strictly need memory wise. Using more cores to have shorter arrays on the separate cores does not feel like an appropriate long term solution ;).
Regards,
Jelle

ferenc_karsai
Global Moderator
Global Moderator
Posts: 422
Joined: Mon Nov 04, 2019 12:44 pm

Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries

#4 Post by ferenc_karsai » Tue Dec 12, 2023 3:14 pm

The clean fix for this will come when scaLAPACK will officially change from integer4 to integer8. This will completely solve the problem.
Until then there is not much we can do, since we absolutely need the parallel SVD solvers from scaLAPACK. It's also hard to know in advance when this problem occurs, so writing warnings is also not easy.

dantasqu
Newbie
Newbie
Posts: 2
Joined: Mon May 08, 2023 11:45 pm

Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries

#5 Post by dantasqu » Fri Dec 15, 2023 7:11 pm

Hey everyone, thank you for the inputs on the problem. I've been trying to run on 128 cores like suggested but I still have some problems with it. Could you recommend a compiler and MPI to try? I've tried the compiled versions below:

FIRST:
module load intel/19.0.4
module load intel-mpi
module load intel-mkl
module load cuda

SECOND:
module load gcc/11.3.0
module load openmpi/4.1.4
module load hdf5/1.12.2
module load intel-oneapi/2021.3
module unload intel-oneapi-mpi/2021.3

Thanks,

ferenc_karsai
Global Moderator
Global Moderator
Posts: 422
Joined: Mon Nov 04, 2019 12:44 pm

Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries

#6 Post by ferenc_karsai » Thu Dec 28, 2023 10:04 am

I don't think it's a problem of the compilers. It rather depends on the size of your calculation. If you have a huge calculation then possibly 128 cores are also not enough. So my suggestion is to try with more cores maybe 256 or more until the problem goes away.
If it still does not help then try this toolchain:
Intel fortran 22.0.1 with Intel MPI 21.5.0

Post Reply