Parallelization Recommendation

Message

Abdulrahman_Allangawi · #16 Post by **Abdulrahman_Allangawi** » Fri May 23, 2025 10:17 pm

Dear,

I figured out what the issue was. Thanks to your help, and some other reading/testing, I have found that the problem was related to the VASP not allowing for more than one CPU physical core per task.

Basically, when I was requesting 48 tasks, and 4 cpus per task, I was only using 25% of my nodes, or 12.5 if we would count the 192 physical cores+ 192 hyperthreaded cores per node.

This was confirmed when I checked the workload of the individual CPUs, where I found that 1 out of every consecutive 4 CPUs from the first 192 CPUs (CPUs 0 - 191; the physical cores) is only used, while none of the hyperthreaded CPUs (CPUs 192 - 383) are not used. Note that here I am talking about the utilization of each node.

As such, I doubt that when I used 192 tasks, the efficiency got worst due to over parallelization. As essentially, when I was requesting 8 nodes earlier (tests 0-2) I was only realistically using 25% of each node, so 2 nodes effectively. But when I fixed the issue, while still requesting 8 nodes, maybe the workload just got spread too thin.

I guess my next step will be sticking to 192 tasks per node, but lowering the number of requested nodes. Additionally, I guess that the value of NCORE, which I previously found to be best at 24, needs to be revisited? As doesn't this mean that effectively also NCORE was being used as 24/4=6?

I would like to head your thoughts of what I found, does it make sense? Additionally, is hyperthreading worth testing? As I can request the IT team to install a version of VASP that can use multithreading, although I suspect that maybe the multithreading did not work due to me adding the --hint=nomultithread in my srun.

#17 Post by **henrique_miranda** » Mon May 26, 2025 8:13 am

Ok great!

You might consider testing the multithreaded version but before that I would recommend you read carefully this page:
wiki/index.php/Combining_MPI_and_OpenMP

I would personally suggest that first you try to get the most efficiency out of a pure MPI paralelization checking the number of CPUs that you can reasonably use and which values of NCORE are reasonable.
This will make it easier to, in a second step, compare to the performance of the openmp version.
Note that when you use the openmp version NCORE is always insternally set to 1 but the workload for each band is distributed over OMP_NUM_THREADS.

It is hard to say a priori wether hyperthreading will be advantageous or not: each hardware producer has different flavours, it will also depend on the type of workload, hardware and configurations.
It very important to bind the threads to cores as documented in the page above.
I personally don't use it in my workloads because I have found that it does not lead to a speedup but your experience might differ.

There is something I don't understand in what you are reporting:
How come the utilization is still 25% in the case where you set --ntasks-per-node=192?

My Community

Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation