Page 1 of 1

Initialization of design matrix failed

Posted: Sun Feb 04, 2024 12:07 am
by jie_yao2
Hi VASP team,

Hope the message finds you well. I am currently work on the MLFF train of ab initio data.

When I use previous ML_ABN file (renamed to ML_AB) to restart, VASP say: ERROR, First Initialization of design matrix (FFM%FMAT) failed.

For the same input files, the job works (starts well) on the same GPU in supercomputer center.

It maybe due to the different installation. Why my design matrix initialization fail on the local GPU ? Where is the possible problem and solution ?
(The files are attached please.)

Thank you a lot for the time,
Jie

Re: Initialization of design matrix failed

Posted: Mon Feb 05, 2024 9:32 am
by manuel_engel1
Hi Jie,

I'll try to assist you with your problem, but first, I would require a bit more information. Are you using the same version of VASP in both cases? What kind of GPUs are you running on? Could you please also attach the OUTCAR files and standard output of the two runs?

Re: Initialization of design matrix failed

Posted: Mon Feb 05, 2024 9:53 am
by jie_yao2
Hi Manuel,

Thank you for your reply.

Both VASP are version 6.4.2. Both GPU are A100.

The OUTCAR1 from local GPU is attached, the OUTCAR from supercomputer center is too large to upload, therefore
I extracted the first 100,000 rows and named it OUTCAR2.

Jie

Re: Initialization of design matrix failed

Posted: Mon Feb 05, 2024 10:07 am
by manuel_engel1
Perfect, thanks. I will look into it.

Re: Initialization of design matrix failed

Posted: Mon Feb 05, 2024 11:55 am
by manuel_engel1
I consulted with our machine-learning experts. It is likely that you run out of memory on your local machine. The error message you encounter is generated from a failed allocation statement in the code.

The ML_LOGFILE contains information regarding the memory requirements of the calculation. Could you please also attach this file? How much memory do you have available on your local and on the remote machine?

Re: Initialization of design matrix failed

Posted: Mon Feb 05, 2024 12:04 pm
by jie_yao2
Hi Manuel,

It should not due to the memory. I tried another of local machine with smaller memory and it runs well.
Local machine has 80 GB, same with remote machine.

File is attached please.

Jie

Re: Initialization of design matrix failed

Posted: Mon Feb 05, 2024 10:16 pm
by jie_yao2
Hi Manuel,

I wondering whether this is due to the different version of HPC SDK and cuda.

The local GPU A100 run with NVHPC 23.1 and cuda 12.0, while the others (works) with cuda 11.x.

Is it possible for you help to check whether this job run well on NVHPC 23.1 and cuda 12.0 on VASP 6.4.2 ? Therefore may
have a clue to direction of search.

Thanks,
Jie

Re: Initialization of design matrix failed

Posted: Tue Feb 06, 2024 9:54 am
by manuel_engel1
Unfortunately, the best advice I can give you is to not use the GPU version of VASP to run the machine-learning code. The ML code does not benefit from GPU parallelization and is, in fact, untested when running VASP on GPU. The error you encounter might be directly related to this. Could you please try to run the code on CPU only and see if the error persists?

Re: Initialization of design matrix failed

Posted: Tue Feb 06, 2024 12:09 pm
by jie_yao2
Hi Manuel,

You mean I can run the CPU version of VASP on multi node CPU cores for the machine learning code ?
(Is using multi core CPU more efficient than GPU when running the ML_MODE = select, refit and production run,
any recommendations for the efficiency in each ML_MODE stage ?)

Tried with the CPU version of VASP on CPU only, the previous error disappeared. However, the GPU is much faster for pure ab initio calculations.

Sorry for one more question about merging different ML_AB files, on vasp wiki: https://www.vasp.at/wiki/index.php/ML_AB
It recommends: strongly advise to group structures with the same number of elements and atoms per element in the training data
About group structures, does it mean: for the combined ML_AB, always use one modified Header specification, then simply put, for example, atom number 48 structures for configuration numbers 1 to 10; then atom number 50 structures for configuration numbers 11 to 20, so it is total 20 structures, Configuration num. 1 to Configuration num. 20. Not necessary to do other things.

Thanks a lot for the help,
Jie

Re: Initialization of design matrix failed

Posted: Tue Feb 06, 2024 1:07 pm
by manuel_engel1
No worries, I hope I can clear things up.
You mean I can run the CPU version of VASP on multi node CPU cores for the machine learning code ?
Yes.
Is using multi core CPU more efficient than GPU when running the ML_MODE = select, refit and production run,
any recommendations for the efficiency in each ML_MODE stage ?
The ML code does not use GPU parallelization. It is currently a CPU-only code.
However, the GPU is much faster for pure ab initio calculations.
That is true. Unfortunately, you are currently restricted to CPU with ML in VASP. And it seems that trying to run ML calculations with a GPU involved will produce errors so I advice against it.

For the additional question regarding the merging of ML_AB files, please open a new topic in the forum.