MLFF Stucking in Learning

Message

burakgurlek · #1 Post by **burakgurlek** » Thu Oct 19, 2023 12:00 pm

Dear all,

I am testing MLFF which works well and collected 1183 structures. However, when I want to continue learning for additional 3000 steps with potim=0.5fs, it stuck at the 3000th step. The algorithm does not trigger any learning step for 2999th steps taking 1.5h. The remaining 22.5h is spent in the last step and it is cancelled due to the time limit of the cluster.

I suspect that the number of reference configurations are too large ~10000 to fit, but I do not know exactly why the last step could not be completed in 22.5h. There is 750GBx4 memory associated to this job I would appreciate any help. The files are attached.

Regards,
Burak

#2 Post by **andreas.singraber** » Thu Oct 19, 2023 2:05 pm

Dear Burak,

indeed it seems there is an issue here. The initial learning in step 0 was successful, so we can assume that there was enough memory in the beginning. During the MD run two additional configurations were collected in the "threshold" steps and kept as candidates for fitting. Because the buffer for new configurations (by default this is ML_MCONF_NEW = 5) was not filled until the very end of the trajectory, training was triggered at the end. Usually if many configurations are added during MD there can be memory issues at the training step (lazy allocations) but in your case only very little extra space was used. So we would assume the final training also to work correctly.

To further investigate this issue please send us all the files we would need to start the training, i.e., additionally the POSCAR, KPOINTS and POTCAR files. Also, please add the OUTCAR, OSZICAR and job submission script as listed in our forum posting guidelines. If possible, try to run the exact same simulation again so we can be sure it is reproducible on your side.

All the best,
Andreas Singraber and Ferenc Karsai

burakgurlek · #3 Post by **burakgurlek** » Fri Oct 20, 2023 12:21 pm

Dear Andreas and Ference,

thanks for the answer and sorry for the incomplete file set. You can find the requested file via

https://www.dropbox.com/scl/fo/ubdrs42b ... zjycq&dl=0

This is a contiuning learning run and the problematic step was 25th correspoding to 250000 step, hence it can take time to reproduce all. However, I run this problematic step more than once and the result was the same.

Regards,
Burak

burakgurlek · #4 Post by **burakgurlek** » Thu Dec 14, 2023 10:26 am

Dear Andreas and Ference,

I would like to follow up with this issue. Do you have any update? I still have this issue.

Best Regards,
Burak

#5 Post by **ferenc_karsai** » Thu Dec 28, 2023 10:09 am

I had to suddenly take over for Andreas.

I cannot acces the uploaded files. Please always upload the small files here so that they can always be downloaded and if you have larger files that can be additionally uploaded on an external plattform.

burakgurlek · #6 Post by **burakgurlek** » Fri Dec 29, 2023 1:06 am

Sorry, my dropbox was out of sync. Here are the files

https://www.dropbox.com/scl/fo/ubdrs42b ... dy34y&dl=0

Regards,
Burak

#7 Post by **ferenc_karsai** » Mon Jan 08, 2024 9:20 am

So I took the ML_AB, POSCAR, INCAR that you have uploaded and changed to NSW=3000 in the INCAR to continue training for additional 3000 steps.
As in your case learning was only done in the 3000th step, but for me everything works fine. On 64 cores of a newer AMD Zen processor it needed 415 seconds.

So it must be a problem with the VASP on your side. I see you are using VASP.6.4.1.
Please download the latest version VASP.6.4.2 and try maybe different compilers.

Also for the moment try to run the same number of cores I ran which is 64 and use the same INCAR file that I used here:

Code: Select all

SYSTEM = Naphtalene
ISYM   = 0        ! no symmetry imposed

! ab initio
PREC   = A
IVDW   = 2
ALGO = FAST

ISMEAR = 0
SIGMA  = 0.04  ! smearing in eV

ENCUT  = 1000
EDIFF  = 1e-6
NBANDS = 320

LWAVE  = F
LCHARG = F
LREAL  = F

! MD
IBRION = 0        ! MD (treat ionic degrees of freedom)
NSW    = 3000    ! no of ionic steps
POTIM  = 0.5      ! MD time step in fs

MDALGO = 4
NHC_NCHAINS = 4
TEBEG  = 295              ! temperature

ISIF = 2        ! update positions, no cell shape and volume

! machine learning
ML_LMLFF  = T
ML_MODE = train
ML_WTSIF  = 2
ML_IALGO_LINREG=1
ML_SION1=0.3
ML_MRB2=12


# LPLANE = .TRUE. ! if NGZ = 3*(number of cores)/NPAR = 3*NCORE
NCORE  = 4
KPAR = 2
ML_CTIFOR = 1.94395421E-02

burakgurlek · #8 Post by **burakgurlek** » Thu Feb 15, 2024 12:28 am

Dear Ferenc,

Thanks for your help. I followed your steps, it indeed initially worked by using 64 CPU, NCORE=4 and KPAR=2. I then change the number of cores to 72, NCORE=12 and KPAR=2 to speed up the calculations and continue the learning. It worked for 30ps learning rung (10x3ps). However, I got the same stucking problem again in the next 3ps learning continuation run. Even continuing the last ML_AB for one more time step stuck. I was able to run this last step successfully when I used your suggestion 64 CPU, NCORE=4 and KPAR=2.

I am a bit puzzled here as changing the number of cores worked out but then stuck again. If it would be a compiler issue, I expect to see a consistent behavior. Do you have an idea why this problem can arise by using more than one node and suddenly appear at certain stages? I also observed the same problem in other simulations. I also tried the version 6.4.2, it did not resolved the issue. Moreover, with 64 cores the NBANDS is updated as this may change something.

You can reach the files via the links:
The stuck step: https://www.dropbox.com/scl/fo/nfekxt5c ... z6nr8&dl=0
The single step continuing from the last stuck step: https://www.dropbox.com/scl/fo/qh2e7f8u ... z4b6j&dl=0
Rerun the stuck step with 64 cores: https://www.dropbox.com/scl/fo/7joq3nu4 ... qyldt&dl=0

Regards,
Burak

#9 Post by **ferenc_karsai** » Fri Feb 16, 2024 8:35 am

Ok, as I understand from your post, everything that is on one node works, but as soon as you go to multiple nodes it hangs.

Do you use "-Duse_shmem" alone or do you also use "-Dsysv" in your compilation?

burakgurlek · #10 Post by **burakgurlek** » Fri Feb 16, 2024 10:39 am

Dear Ferenc,

I have not complied it myself. Would you let me know how can I check this?

I also tried two nodes with 64 cores each so far it works. Morever, 6.4.2 version also seems to work so far, but it may fail as I have observed.

Regards,
Burak

burakgurlek · #11 Post by **burakgurlek** » Fri Feb 16, 2024 10:44 am

just one quick update version 6.4.2 also stuck on one another run.

Regards,

#12 Post by **ferenc_karsai** » Thu Feb 22, 2024 3:51 pm

Ok I've run your calculation with 64 and 72 cores and it finishes fine. I've run with the AOCC compiler using system V shared memory(-Duse_shmem and -Dsysv). I'm not going to test all toolchains now because the calculation is quite time consuming.
Could you find out find out the toolchain with which it fails for you? It could be a problem with your compilation but it could be also a bug which comes out only with a specific compiler (I had that already in the past).

And also ask as I already wrote if -Dsysv was used in the compilation. Try to compile with the opposite, so if -Dsysv was used then recompile without it and rerun the calculation and vice versa. Shared memory can also be sometimes a source of error. You can't run without it, because the job is too big to fit into memory without it on so many cores.

My Community

MLFF Stucking in Learning

MLFF Stucking in Learning

Re: MLFF Stucking in Learning

Re: MLFF Stucking in Learning

Re: MLFF Stucking in Learning

Re: MLFF Stucking in Learning

Re: MLFF Stucking in Learning

Re: MLFF Stucking in Learning

Re: MLFF Stucking in Learning

Re: MLFF Stucking in Learning

Re: MLFF Stucking in Learning

Re: MLFF Stucking in Learning

Re: MLFF Stucking in Learning