scaling

Message

martonak · #1 Post by **martonak** » Wed Jun 10, 2009 3:05 pm

I am running MD calculations with VASP and I am surprised by the scaling. VASP is running on a cluster made of blades, each blade contains 2 quad-core AMD Barcelona CPUs, connected with Infiniband. For a system of about 100 atoms (e.g. 32 CO2 molecules, 512 electrons) I get a speedup of almost 2 when going from 1 to 2 blades (8 to 16 cores). However, when going further from 2 to 4 blades (16 to 32 cores) I get hardly a speedup of 1.3. Is this a normal behaviour ? I ran tests of CPMD with 32 water molecules problem (256 electrons) on the same cluster and it scales well up to 64 cores. Is there a way to get VASP to scale to at least 32 cores ? I compiled VASP with ifort version 11, used -DMPI_BLOCK=16000 -DPROC_GROUP=4 -DscaLAPACK and linked to Intel mkl libraries (including fftw from mkl). I tried to use also -Duse_collective and the result is practically the same.

Thanks in advance.

Roman

metosa · #2 Post by **metosa** » Wed Jun 10, 2009 5:31 pm

NPAR and NSIM tags greatly effect the scaling. You should try all combinations for NPAR 1 2 4 8 and NSIM 1 2 4 8 for a small and large problem.

martonak · #3 Post by **martonak** » Thu Jun 11, 2009 6:59 am

Sure, I had tried all possibilities for NPAR and found the best value to be equal to the number of blades so for 16 cores I used NPAR = 2 and for 32 cores NPAR=4. For NSIM I did not systematically explore all possibilities but there seemed to be only little effect (few %). I am currently using NSIM=8. I will try the combinations to see whether there could be any improvement.

d-farrell2 · #4 Post by **d-farrell2** » Thu Jun 18, 2009 5:40 pm

I have found the scaling of VASP 4.6.34 to be rather poor in general (as in efficiency is very low by the time you reach 16 procs for most architectures in my experience), though NSIM and NPAR can be tweaked to get better performance (takes some exploring as it is problem dependent). Also optimized libraries can help, but don't expect any miracles.

alex · #5 Post by **alex** » Fri Jun 19, 2009 12:44 pm

Hi,

some comments:

In your test CPMD vs. VASP you changed two parameters moving from one test to another. Do you really expect any meaningful? This is rather poor science. Sorry for that. So do the water box with VASP...

Think about the system size: You are practically running 1 CO2 per core. And this needs communication with the other 31. I think you just reached the senseful number of cores for your system.

One other thing about CPMD:
If I remember correctly, the water test comes along with the package. I would also assume, Troullier-Martins pseudopotentials are used. They normally require high cutoff energies. So the numerical problem is probably larger than VASP/CO2. Hence the good scaling... ?!

Cheers

Alex

martonak · #6 Post by **martonak** » Mon Jun 22, 2009 9:10 am

My question was not meant to be neither science nor comparison of VASP and
CPMD. Even though I read various reports on the net claiming VASP to scale
to 64 and even more cores with systems containing even less than 100 atoms,
I just can't get VASP to scale with reasonable efficiency beyond 16 cores
for such systems. If the argument about 32 cores being a scaling limit for
a system consisting of 32 molecules were valid, CPMD would not be able to
scale for 32 water molecules to 64 cores neither. Of course, CPMD uses MT
pseudopotentials with much higher cutoff - the water benchmark uses cutoff
of 100 Ry. On the other hand, 32 CO2 molecules have 512 electrons while 32
H2O molecules only 256. The question was simply meant to find out, for a
typical problem counting about 100 atoms and standard cutoff appropriate
for VASP, running on cluster with recent CPUs and Infiniband network, to
what number of cores one can typically expect a scaling with reasonable
efficiency (Gamma-point only version and ALGO=fast). 16 ? 32 ? 64 ? 128 ?
... ?

alex · #7 Post by **alex** » Mon Jun 22, 2009 11:15 am

Hi,

regarding the "net" scaling: Do they provide the parameters? Like KPOINT mesh, Cutoff etc. ...

Other question: Are you sure, your MP-Interface is using Infiniband communication?

cheers

alex

pkroll · #8 Post by **pkroll** » Tue Jun 23, 2009 1:56 am

In my experience on various platforms (many...) and many models with 200-1000 atoms, the scaling of VASP is OK up to 32 processors (still at 70-85% efficiency) and drops significantly thereafter.
I had good scaling with Power6 and SunSparc (both Myrinet). I typically have worse scaling with Intel quad-core....

On Intel quad-core the processor itself can create the bottle neck, if the communication between the cores gets to high.

A simple way to check this (thus, if the processor-internal communication of the Intel quad-cores is the bottle neck in a computation) is to run the job on the Infiniband network with 1 task per processor (not per core !). Then increase it to 2 tasks per proc, then at last to 4. On my cluster (Altix SE with Infiniband), running through the network on 8 different procs on 4 boards is faster than running on one board with its two quad-core procs .....
[may call for getting some pricy quad-cores with high interconnect speed]

so, your network may be fine, VASP may have a reasonable scaling (could be better though ^^) --- it's not unlikely that your Intel procs (and mine as well) are keeping what they suggest ....