Page 1 of 1

ML Restart Initialisation stuck when more than one process per node

Posted: Mon Nov 14, 2022 9:20 pm
by reach2sayan
Hi,


The issue I face is that when I restart a ML training, but with a new species.
ie. I use ML_ISTART = 1, my ML_AB has Zr and Cu. However the new POSCAR that I'm starting to train on has Zr,Cu and Al. In that case, in the openMP version,
setting
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2

the initialisation of the machine learning seems to get stuck. This does not happend when

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1.
That is it gets stuck only if I declare more than one process per node.

I'm trying liquid structures, so I have a high-ish ML_MB and ML_MCONF. I also have ML_LBAND_DISCARD = True. But the issue is happening irrespective of the setting. My old ML_AB has large number of structures though.

I have attached the ML_AB, INCAR and job submission script example here. I would appreciate any suggestion on the INCAR. Especially if ML_MB = 6000 is too diabolical even for a liquid.

Thank you

Re: ML Restart Initialisation stuck when more than one process per node

Posted: Mon Nov 21, 2022 4:02 pm
by henrique_miranda
Ok, we had a look at your input files but we need a bit more information:
Could you please share the OUTCAR and ML_LOGFILE?
Did you compile VASP with support for shared memory?

A few suggestions that you might try:
1. There is no support for OpenMP in the machine learning part of the code so using might not lead to a great speedup.
2. You might reduce ML_CONF to for example 1500 and significantly reduce the memory usage of your calculation

Re: ML Restart Initialisation stuck when more than one process per node

Posted: Tue Dec 27, 2022 11:01 pm
by reach2sayan
Hi,

Sorry for the delay. I needed some preliminary data quick and hence kept running on a single task per node basis. But I think it's time to fix the issue. I have a feeling that the problem is not about the ML but the openMP itself since same issues persist without ML. I'm attaching the OUTCAR and stdout (vasp.out) for the single task per node case as well as the 2task per node test.

The attachment contains the following main files:
OUTCAR.2task - failed OUTCAR, it always gets stuck at that last line
OUTCAR.1task - I soft stopped after 3 ionic steps
The corresponding job script and stdout are named as *.2task and *.single
Then there is the makefile.include. The INCAR and POSCAR is consistent across all. It is the same as before, but I removed the ML tags and reduced EDIFF and increase KPAR all for for quicker runs (I also got more nodes to match KPAR).

Best
Sayan

PS. I don't think shared memory is turned on. I will make sure to ask the sys admin specifically for this on the next re-compile.
Side Note. Could you quickly explain the difference between ML_MB and ML_MCONF. I don't exactly understand what is the difference in the items that each tag sets the limit for.

Re: ML Restart Initialisation stuck when more than one process per node

Posted: Sat Feb 04, 2023 10:29 pm
by reach2sayan
After lots of trials, I finally managed to make it work.

Re: ML Restart Initialisation stuck when more than one process per node

Posted: Sun Feb 05, 2023 8:27 am
by ferenc_karsai
Could you please elaborate what was the problem and solution?