GW0: problems with CALCULATE_XI_REAL and memory insufficiency

Message

pascal_boulet1 · #1 Post by **pascal_boulet1** » Thu Apr 18, 2024 12:57 pm

Dear all,

I am using VASP 6.4.2, and I am trying to calculate the dielectric function using GW0. I don’t think my system is too big for this kind of calculation: 79 occupied orbitals and 10 k-points (Gamma centered 4x4x1 grid).

I have diagonalized exactly the Hamiltonian with 128 bands. Now I am trying to calculate the dielectric function. I am following the tutorial on Si.

Actually, I am facing two problems: one is related to the message “CALCULATE_XI_REAL: KPAR>1 not implemented, sorry.” the other is insufficient memory.

The INCAR is the following:
ALGO = EVGW0R
LREAL = auto
LOPTICS = .TRUE.
LSPECTRAL = .TRUE.
LPEAD = .TRUE.
NOMEGA = 12
NBANDS = 128
NELMGW = 4
ISMEAR = 0
SIGMA = 0.05
EDIFF = 1e-8
KPAR = 8

With this input I get “CALCULATE_XI_REAL: KPAR>1 not implemented, sorry.” and the job stops. Although I understand the message, I am not sure to which keyword it is related. Could you please help me with this?

Now, if I switch to ALGO = EVGW0, the crash disappears. However, with ALGO = EVGW0, I am now having the problem of the “shortage” of memory.

I am using a HPC supercomputer with 8 nodes, 2 processors/node, 64 cores/processors and 256 Go RAM / node. So, it is 2 Go/core, and I guess I should have over 2 To RAM for the job.
I am using MPI+OpenMP: 256 MPI + 4 threads per rank. But using only MPI leads to the same end.

In the OUTCAR I have the following information:
running 256 mpi-ranks, with 4 threads/rank, on 8 nodes
distrk: each k-point on 32 cores, 8 groups
distr: one band on NCORE= 1 cores, 32 groups

total amount of memory used by VASP MPI-rank0 72479. kBytes
available memory per node: 6.58 GB, setting MAXMEM to 6742
...
OPTICS: cpu time 18.2553: real time 4.9331
...
files read and symmetry switched off, memory is now:
total amount of memory used by VASP MPI-rank0 116900. kBytes
...
| This job will probably crash, due to insufficient memory available. |
| Available memory per mpi rank: 6742 MB, required memory: 6841 MB. |

min. memory requirement per mpi rank 6841.5 MB, per node 218927.6 MB

Nothing more.

Note, the command “cat /proc/meminfo | grep MemAvailable” gives “252465220 kB”, But in the log file I get the information:
available memory per node: 6.58 GB, setting MAXMEM to 6742

The figures look contradictory to me between what I have in the OUTCAR or the log file, and what I get from meminfo.

Is there something wrong in my setting or something I misunderstand regarding the memory management?

Thank you for your help and time,
Pascal

#2 Post by **michael_wolloch** » Thu Apr 18, 2024 1:53 pm

Hi Pascal,

please provide a minimal reproducible example of your problem when you post on the forum.

From what I can see from your post, you are having trouble with KPAR.
You set

Code: Select all

KPAR = 8

and the error you receive warns you that KPAR>1 is not implemented:

Code: Select all

CALCULATE_XI_REAL: KPAR>1 not implemented, sorry.

The solution is to set KPAR = 1, which is also the default. However, the cubic scaling GW routines do require more memory than the old GW routines, so while the error you get will disappear, the memory problem may persist. In that case, you should switch to the old GW routines, but still get rid of KPAR, or at least set it to a lower value (4 or 2).

If you use KPAR, memory requirements increase. In your case, you set KPAR=8, so e.g. 16 k points will be split into 8 groups of cores, that work on 2 k points each. However, all those groups work on all orbitals and have to keep a copy of all orbitals in memory. So your memory requirements will increase by roughly a factor of KPAR! By setting KPAR=8, you effectively negate the additional memory you get from increasing the number of compute nodes, because you have to store 8 times as much stuff.

If your problem is memory per core, you can increase it by decreasing the number of cores you use. E.g. fill every node with only 64 mpi ranks, but make sure that they are evenly distributed over both sockets. Now you have 4 GB/c instead of 2. This will also increase memory bandwidth per core, which is often a bottleneck for VASP on very high core count machines.

Make sure to read up on GW calculations here!

Let me know if this helps,
Michael

pascal_boulet1 · #3 Post by **pascal_boulet1** » Mon May 06, 2024 11:32 am

Hello,

Thank you for the hints. I have spent quite some time trying to run the GW0 calculation.

As you said, since I have few k-points, I can change KPAR for 1.

But there is no way! If I select, e.g., 512 cores with 128 orbitals, I have to set NCORES=#CORES/#ORBITALS. If I don't set NCORES=#CORES/#ORBITALS,
then NCORE=1, and in this case, VASP changes the number of orbitals but does not read WAVEDER, since it has been created for 128 orbitals.

But, if I set NCORES=#CORES/#ORBITALS, the job fails because VASP is complaining about a change in the number of k-points between the "DFT" and the "HF" calculations.
So, as a workaround, VASP says to set NCORE to 1!

The snake bites its tail!

So, the number of bands depends on the number of parallel cores used. Is there a way to constrain VASP to strictly use the number of bands specified in WAVEDER?

Otherwise, I have tried what you suggested: I have run the job on various number of nodes (1 node = 128 cores) but setting the number of mpi ranks to 256, which is the number of bands. I have reserved all the nodes to have the full memory. Still, even with 16 nodes, the job fails with an out-of-memory message.

I can try with less bands... but on the website, it is sait that we should use as many orbitals as we can (NBANDS=maximum number of plane-waves). In my case it is +17000 plane-waves, so unfeasible.

You can have a look at the files in the archive I have attached. Maybe you can have some ideas.

Thank you,
Best regards,
Pascal

#4 Post by **michael_wolloch** » Mon May 06, 2024 1:42 pm

Dear Pascal,

I am a bit confused by your attached files and the information you provide in your post. In all your INCAR files, NBANDS is not set, corresponding to the single step GW procedure. However, you mention problems with reading the WAVEDER file, which would point to a traditional multi-step approach. What are you trying to do? If you are doing a DFT calculation first, you can set the parameter NBANDS there, and also in the GW step. Of course, the KPOINT file, which you did not add to your provided archive must also be the same.

From your OUTCAR-4826388 it seems that you are running the single step GW procedure, which indeed gives you 17000+ orbitals.

So maybe it is enough to simply redo the DFT step(s) of your calculation with an adequate number of bands (e.g. 512 or 1024, since less than 100 should be occupied for your system), recalculate WAVEDER and WAVECAR and copy them over, and then set the same number for NBANDS in your GW0 INCAR.

In the calculation on 16 nodes with 16 MPI ranks per node (OUTCAR-4827224) you get further along than in the others, but the memory is still insufficient since you would need about 19 GBs per rank, or ~300GB total per node:

Code: Select all

 -----------------------------------------------------------------------------
|                                                                             |
|           W    W    AA    RRRRR   N    N  II  N    N   GGGG   !!!           |
|           W    W   A  A   R    R  NN   N  II  NN   N  G    G  !!!           |
|           W    W  A    A  R    R  N N  N  II  N N  N  G       !!!           |
|           W WW W  AAAAAA  RRRRR   N  N N  II  N  N N  G  GGG   !            |
|           WW  WW  A    A  R   R   N   NN  II  N   NN  G    G                |
|           W    W  A    A  R    R  N    N  II  N    N   GGGG   !!!           |
|                                                                             |
|     This job will probably crash, due to insufficient memory available.     |
|     Available memory per mpi rank: 4271 MB, required memory: 18814 MB.      |
|     Reducing NTAUPAR or using more computing nodes might solve this         |
|     problem.                                                                |
|                                                                             |
 -----------------------------------------------------------------------------

You said that you have 256 GB available per node, but the information in the warning above would indicate more like ~70GB (16 ranks times 4271 MB per rank). Is it possible that something else is running on the node, or that you somehow limit the amount of accessible memory? Maybe you are not using all available CPU sockets? In your "Janus_GW0_256-4827224.o" file, there is a line in the execution sum-up:

Code: Select all

Limits    : time = 1-00:00:00 , memory/job = 1775 Mo

Could this mean that your slurm configuration limits the memory per job in some way? Please talk to your cluster administration if that is the case.

you could also fall back to the quartic scaling if you want to keep the single-step procedure, since that results in significantly lower memory usage. This will probably still not be enough, however, if your jobs cannot utilize the full memory of your node.

I hope that helps, please report back if you get this to work or if you have gathered more information,
Cheers, Michael

pascal_boulet1 · #5 Post by **pascal_boulet1** » Mon May 06, 2024 2:38 pm

Dear Mickael,

Just for information.

Well, maybe my INCAR files are not consistent?

What I want to do is GW0 with the EVGW0R algorithm, and if it works go on with BSE. If it is not possible, then run mBSE (which I am also trying anyway).

For the use of full sockets or not, I think that setting #MSUB -x in the submission script allows me to reserve the whole node irrespective of the number of cores I am using. That's indeed what the system managers told me. I agree with you that the information: memory/job = 1775 Mo is intriguing, and I also noticed it.
I am going to ask again the system managers.

According to the center website, one node is equipped with 256Go, there are 2 processors per node with 64 cores each.

Otherwise, I though that setting NBANDS was not compulsory, since the number of bands may be changed anyway depending on the number of cores I use... Only at the prior, exact diagonalization step (not in the archive!!!) do I set it to the number of bands I want; and in the subsequent calculations I use the number of cores accordingly.

I am going to continue searching for a solution and tell the forum when I have new information.

Thank you
Pascal

#6 Post by **michael_wolloch** » Mon May 06, 2024 3:28 pm

Dear Pascal,

thanks for the additional information.

My comment about sockets was also related to process placement. Even if you reserve the whole node, you have to make sure that your 16 processes are split between the CPUs to get access to all the memory. There are probably good defaults set by your cluster administration, but you should double-check. Run with the

Code: Select all

--report-bindings

option on your mpirun command for openMP and

Code: Select all

-genv I_MPI_DEBUG=4

if you are using IntelMPI.

If you find out that your processes are not distributed well, look up process pinning and mapping for your MPI implementation.
For IntelMPI:
https://www.intel.com/content/www/us/en ... nning.html
https://www.intel.com/content/www/us/en ... lator.html
For openMPI:
https://www.open-mpi.org/doc/v3.0/man1/mpirun.1.php (look for --map-by)

If you are not setting NBANDS in a GW calculation it will use all possible bands, based on ENCUT. That are very many, as you noted previously, and not applicable for your rather large system.

If you instead set NBANDS, VASP will use the smallest multiple of the number of MPI ranks that is larger than NBANDS (assuming NCORE=1 and KPAR=1). Note that I suggested trying 512 and 1024 bands, which both are multiples of the 256 ranks you use. If your results are not within a tolerable difference for 512 and 1024 bands, you will have to increase the number of bands further, e.g. to 1536 or 2048 until you are converged.

Please follow the suggestions for multi-step GW0:
1) DFT Step:
a) Run a normal DFT SCF calculation with only a small number of unoccupied bands (using e.g. 64 cores and 128 bands for your system).
b) Run with the larger number of unoccupied states (e.g. NBANDS=512, possibly still 64 cores), ALGO = Exact, and LOPTICS = T, starting from the converged WAVECAR from calculation (a).
2) GW step:
Set the same number for NBANDS here. Use more nodes to have more memory available. Since you are now using much fewer bands, the memory requirements will be much lower than for the 17000+ bands, so you might get away with only 2 or 4 nodes and 128 or 256 ranks distributed between them.

Please try this procedure as soon as you figure out what is going on with the memory/job? Mo and Go are for Megaoctet and Gigaoctet, which are equivalent to MB and GB in English, right?

Thanks, Michael

pascal_boulet1 · #7 Post by **pascal_boulet1** » Thu May 09, 2024 12:42 pm

Hi Michael,

I have started from scratch the calculations.

First, the k-point mesh is Gamma-centered 4x4x1, which gives 10 k-points.

I am using OpenMPI, fftw3/3.3.8, blas/netlib/3.8.0, lapack/netlib/3.9.0 and scalapack/netlib/2.1. But, these information are listed
in the *.e log files when the module is loaded.

Actually, I misinterpreted the -x option in the submission script (submit.vasp): I thought it allows to reserve a whole node with
all the memory, but it does not.
To get all the memory I have to set the -c to reserve all the features available on a node, so for instance:
#MSUB -n 16 # 16 nodes (= 2048 cores)
#MSUB -N 512 # 512 mpi processes (= number of bands)
#MSUB -c 4 # to have all the memory of each node at disposal
#MSUB -x # to reserve the whole node for my job (no other jobs on the nodes)

With this configuration, the job can run, whereas on 8 nodes it cannot (not enough memory). I have not tried with intermediate
numbers of nodes...

Now, in the 02.GW0/OUTCAR file we can read:
>>> estimated memory requirement per rank 6371.3 MB, per node 203880.7 MB
and each node is equipped with about 250 GB.
According to these figures, I guess the job should NOT run under less than 16 nodes.

So, I have run the following jobs:
In 01.DFT/:
1) SCF with PE for 128 orbitals (job 4847767)

2) Restart SCF with PE for 512 orbitals (job 4848357)
The convergence is faster.
At this stage I have also run OPTICS and LPEAD=T, although I think it is useless for my purpose, except for comparison.
Bizarrely Im[𝛆(𝛚)]=0 and Re[𝛆(𝛚)]=1. I don't know why...

3) Restart with ALGO=EXACT for 1 step (NELM=1), as mentioned in the "practical guide to GW" (job 4848393).
I am not sure if 1 step only is wise, but the total energy is about 10 meV lower than that of step 2.
Again: Im[𝛆(𝛚)]=0 and Re[(𝛚)]=1.

4) Run the GW0R job in 02.GW0/:
Restart with WAVECAR, CHGCAR and WAVEDER from step 3, but WAVEDER is recalculated and is larger in size than in step 3.
Is it due to a "change" of basis set (HF with 16 k-points) ?

Now, Im[𝛆(𝛚)] and Re[(𝛚)] have finite values.

Something strange in the OUTCAR is the listing of the QP orbital energies.
After the section "QP shifts evaluated in KS or natural orbital/ Bruckner basis" and subsequent listing of "total energies"
(called Hartree-Fock free energy of the ion-electron system (eV)), there is a listing of 1024 eigenvalues for each of the 10 k-points!
See, e.g. starting from line 7633 of the OUTCAR.
512 orbitals are first listed with 79 occupied ones (158 electrons... ok) and the rest empty, followed by another set of
512 orbitals (numbered 513-1024) with slightly different energies, and having again 79 occupied orbitals and the rest empty.

What does this mean?

The job finally failed with a bug message (in the file Janus_GW0_512-4864161.e):
internal error in: GG_base.F at line: 4673
INVERT_IRRED_CHI for NOMEGA_CURRENT= 1 1st call of P?GETRF returns
8769
If you are not a developer, you should not encounter this problem.
Please submit a bug report.

As I know, P(D/Z)GETRF of scalapack computes the LU factorization, but why should it be a bug (of VASP?) as it seems that the failure is
in the LU decomposition (INFO /= 0) ?

In the OUTCAR, there are 2 reports of the 'Trace of density matrix', one of which looking strange:
Line 23107: Trace of density matrix (electrons) 156.3690279217 0.0000000000
Line 23108: correlated contrib. to density matrix -1.6309720783 0.0000000000

Line 39177: Trace of density matrix (electrons) NaN 0.0000000000
Line 39178: correlated contrib. to density matrix -0.3119242145 0.0000000000

Recall that there are 158 electrons in the system.

Thank you for your time and help,
Pascal

#8 Post by **michael_wolloch** » Tue May 14, 2024 12:01 pm

Dear Pascal,

sorry that it took a while to respond this time.

I am glad that you were able to sort out your settings and get the calculation to run with enough memory.

I think you misunderstood the DFT step a bit. According to the practical guide for GW calculations should do two steps. Converge the groundstate for a few bands, and then diagonalize additional empty bands with ALGO = Exact and NELM=1. But your 3 steps should also be fine if step 2 converges well.

The reason why you have 1024 bands listed in your OUTCAR is due to

Code: Select all

CALL WRITE_EIGENVAL_NBANDS(  WMEAN%WDES, WMEAN, IO%IU6, LAST_FILLED_OPTICS(W)*2)

, where WMEAN is for the mean-field hamiltonian.

Code: Select all

LAST_FILLED_OPTICS

is a routine that prints out the last occupied band, modulo the bands that are treated in parallel. Note that is multiplied by two in the above call. So you would expect it to return 158 bands. But in that case, the Fermi-weights that are used to determine the band filling are for different orbitals (W) than the occupation written out (WMEAN). All of your 512 orbitals have non-zero Fermi-weights in W it seems, so you end up with 1024 bands. I will ask a colleague more knowledgeable about GW calculations if this is intended behavior.

That you now experience a bug might be a completely different problem. You are right that the scalapack routine fails, but this happens because the input was probably wrong, which is a VASP issue. Such messages make it much easier to debug the code. It might be related to the NaN results you get for eigenvalues and other energies... I, or a colleague, will get back to you on that once we have analyzed this further.

Thanks for your patience,
Michael

#9 Post by **merzuk.kaltak** » Tue May 14, 2024 1:06 pm

pascal_boulet1 wrote: ↑Thu May 09, 2024 12:42 pm Hi Michael,

I have started from scratch the calculations.

First, the k-point mesh is Gamma-centered 4x4x1, which gives 10 k-points.

I am using OpenMPI, fftw3/3.3.8, blas/netlib/3.8.0, lapack/netlib/3.9.0 and scalapack/netlib/2.1. But, these information are listed
in the *.e log files when the module is loaded.

Actually, I misinterpreted the -x option in the submission script (submit.vasp): I thought it allows to reserve a whole node with
all the memory, but it does not.
To get all the memory I have to set the -c to reserve all the features available on a node, so for instance:
#MSUB -n 16 # 16 nodes (= 2048 cores)
#MSUB -N 512 # 512 mpi processes (= number of bands)
#MSUB -c 4 # to have all the memory of each node at disposal
#MSUB -x # to reserve the whole node for my job (no other jobs on the nodes)

With this configuration, the job can run, whereas on 8 nodes it cannot (not enough memory). I have not tried with intermediate
numbers of nodes...

Now, in the 02.GW0/OUTCAR file we can read:
>>> estimated memory requirement per rank 6371.3 MB, per node 203880.7 MB
and each node is equipped with about 250 GB.
According to these figures, I guess the job should NOT run under less than 16 nodes.

So, I have run the following jobs:
In 01.DFT/:
1) SCF with PE for 128 orbitals (job 4847767)

2) Restart SCF with PE for 512 orbitals (job 4848357)
The convergence is faster.
At this stage I have also run OPTICS and LPEAD=T, although I think it is useless for my purpose, except for comparison.
Bizarrely Im[𝛆(𝛚)]=0 and Re[𝛆(𝛚)]=1. I don't know why...

3) Restart with ALGO=EXACT for 1 step (NELM=1), as mentioned in the "practical guide to GW" (job 4848393).
I am not sure if 1 step only is wise, but the total energy is about 10 meV lower than that of step 2.
Again: Im[𝛆(𝛚)]=0 and Re[(𝛚)]=1.

4) Run the GW0R job in 02.GW0/:
Restart with WAVECAR, CHGCAR and WAVEDER from step 3, but WAVEDER is recalculated and is larger in size than in step 3.
Is it due to a "change" of basis set (HF with 16 k-points) ?

Now, Im[𝛆(𝛚)] and Re[(𝛚)] have finite values.

Something strange in the OUTCAR is the listing of the QP orbital energies.
After the section "QP shifts evaluated in KS or natural orbital/ Bruckner basis" and subsequent listing of "total energies"
(called Hartree-Fock free energy of the ion-electron system (eV)), there is a listing of 1024 eigenvalues for each of the 10 k-points!
See, e.g. starting from line 7633 of the OUTCAR.
512 orbitals are first listed with 79 occupied ones (158 electrons... ok) and the rest empty, followed by another set of
512 orbitals (numbered 513-1024) with slightly different energies, and having again 79 occupied orbitals and the rest empty.

What does this mean?

The job finally failed with a bug message (in the file Janus_GW0_512-4864161.e):
internal error in: GG_base.F at line: 4673
INVERT_IRRED_CHI for NOMEGA_CURRENT= 1 1st call of P?GETRF returns
8769
If you are not a developer, you should not encounter this problem.
Please submit a bug report.

As I know, P(D/Z)GETRF of scalapack computes the LU factorization, but why should it be a bug (of VASP?) as it seems that the failure is
in the LU decomposition (INFO /= 0) ?

In the OUTCAR, there are 2 reports of the 'Trace of density matrix', one of which looking strange:
Line 23107: Trace of density matrix (electrons) 156.3690279217 0.0000000000
Line 23108: correlated contrib. to density matrix -1.6309720783 0.0000000000

Line 39177: Trace of density matrix (electrons) NaN 0.0000000000
Line 39178: correlated contrib. to density matrix -0.3119242145 0.0000000000

Recall that there are 158 electrons in the system.

Thank you for your time and help,
Pascal

Dear Pascal,

below I address the points raised in your post.

2-3.) You set LPEAD=T in the INCAR, although you have technically a metallic system. See for instance the partially occupied state No. 79 at the Gamma point in OUTCAR-4847767. For metals you should switch off LPEAD and use LFINITE_TEMPERATURE=T. This will probably cure most of the subsequent issues.

However, I suspect you do not want the system to have partially occuppied states anyway.
If so, you should try to get rid of them, by studying the magnetic solution of the system.

4) What is odd indeed, is the fact that the OUTCAR in step 4 does not have any partial occupancies anymore, e.g. band 79 is fully occupied and band 80 at the gamma point is empty.
If you indeed started the GW step from the WAVECAR obtained in step 3, I suspect the reason is that you run with 512 MPI ranks and 512 bands. That many ranks for such a small number of bands seems excessive.
I suggest to reduce the number of ranks to 128 (at maximum) and/or increase the number of bands beyond 512.
In fact, you are using too few bands for your GW calculation. The resulting QP energies are probably not converged.
We typically suggest to use all bands generated by ENCUT. In your case this is roughly ~17000 bands as following line in any of the OUTCAR suggests

Code: Select all

maximum and minimum number of plane-waves per node :     17052    16907

BTW, this is also the reason why the all-in-one GW mode uses these number of bands in the GW calculation.

A final remark concerning memory in GW calculations. If available, you can allocate more nodes, but place only few MPI ranks on them. That is, you do not fully saturate the compute nodes.

My Community

GW0: problems with CALCULATE_XI_REAL and memory insufficiency

GW0: problems with CALCULATE_XI_REAL and memory insufficiency

Re: GW0: problems with CALCULATE_XI_REAL and memory insufficiency

Re: GW0: problems with CALCULATE_XI_REAL and memory insufficiency

Re: GW0: problems with CALCULATE_XI_REAL and memory insufficiency

Re: GW0: problems with CALCULATE_XI_REAL and memory insufficiency

Re: GW0: problems with CALCULATE_XI_REAL and memory insufficiency

Re: GW0: problems with CALCULATE_XI_REAL and memory insufficiency

Re: GW0: problems with CALCULATE_XI_REAL and memory insufficiency

Re: GW0: problems with CALCULATE_XI_REAL and memory insufficiency