ML Restart Initialisation stuck when more than one process per node

Question on input files/tags, interpreting output, etc.

Please check whether the answer to your question is given in the VASP online manual or has been discussed in this forum previously!

Moderators: Global Moderator, Moderator

Post Reply
Message
Author
reach2sayan
Newbie
Newbie
Posts: 6
Joined: Sun Oct 16, 2022 9:49 pm

ML Restart Initialisation stuck when more than one process per node

#1 Post by reach2sayan » Mon Nov 14, 2022 9:20 pm

Hi,


The issue I face is that when I restart a ML training, but with a new species.
ie. I use ML_ISTART = 1, my ML_AB has Zr and Cu. However the new POSCAR that I'm starting to train on has Zr,Cu and Al. In that case, in the openMP version,
setting
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2

the initialisation of the machine learning seems to get stuck. This does not happend when

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1.
That is it gets stuck only if I declare more than one process per node.

I'm trying liquid structures, so I have a high-ish ML_MB and ML_MCONF. I also have ML_LBAND_DISCARD = True. But the issue is happening irrespective of the setting. My old ML_AB has large number of structures though.

I have attached the ML_AB, INCAR and job submission script example here. I would appreciate any suggestion on the INCAR. Especially if ML_MB = 6000 is too diabolical even for a liquid.

Thank you
You do not have the required permissions to view the files attached to this post.

henrique_miranda
Global Moderator
Global Moderator
Posts: 414
Joined: Mon Nov 04, 2019 12:41 pm
Contact:

Re: ML Restart Initialisation stuck when more than one process per node

#2 Post by henrique_miranda » Mon Nov 21, 2022 4:02 pm

Ok, we had a look at your input files but we need a bit more information:
Could you please share the OUTCAR and ML_LOGFILE?
Did you compile VASP with support for shared memory?

A few suggestions that you might try:
1. There is no support for OpenMP in the machine learning part of the code so using might not lead to a great speedup.
2. You might reduce ML_CONF to for example 1500 and significantly reduce the memory usage of your calculation

reach2sayan
Newbie
Newbie
Posts: 6
Joined: Sun Oct 16, 2022 9:49 pm

Re: ML Restart Initialisation stuck when more than one process per node

#3 Post by reach2sayan » Tue Dec 27, 2022 11:01 pm

Hi,

Sorry for the delay. I needed some preliminary data quick and hence kept running on a single task per node basis. But I think it's time to fix the issue. I have a feeling that the problem is not about the ML but the openMP itself since same issues persist without ML. I'm attaching the OUTCAR and stdout (vasp.out) for the single task per node case as well as the 2task per node test.

The attachment contains the following main files:
OUTCAR.2task - failed OUTCAR, it always gets stuck at that last line
OUTCAR.1task - I soft stopped after 3 ionic steps
The corresponding job script and stdout are named as *.2task and *.single
Then there is the makefile.include. The INCAR and POSCAR is consistent across all. It is the same as before, but I removed the ML tags and reduced EDIFF and increase KPAR all for for quicker runs (I also got more nodes to match KPAR).

Best
Sayan

PS. I don't think shared memory is turned on. I will make sure to ask the sys admin specifically for this on the next re-compile.
Side Note. Could you quickly explain the difference between ML_MB and ML_MCONF. I don't exactly understand what is the difference in the items that each tag sets the limit for.
You do not have the required permissions to view the files attached to this post.

reach2sayan
Newbie
Newbie
Posts: 6
Joined: Sun Oct 16, 2022 9:49 pm

Re: ML Restart Initialisation stuck when more than one process per node

#4 Post by reach2sayan » Sat Feb 04, 2023 10:29 pm

After lots of trials, I finally managed to make it work.

ferenc_karsai
Global Moderator
Global Moderator
Posts: 422
Joined: Mon Nov 04, 2019 12:44 pm

Re: ML Restart Initialisation stuck when more than one process per node

#5 Post by ferenc_karsai » Sun Feb 05, 2023 8:27 am

Could you please elaborate what was the problem and solution?

Post Reply