Hardware Configuration Advice for DFT and GW Calculations with VASP.

Message

hszhao.cn@gmail.com · #1 Post by **hszhao.cn@gmail.com** » Mon Apr 08, 2024 6:51 am

Hello everyone,

I'm in the process of setting up a new computing environment specifically optimized for running VASP for both DFT and GW calculations. Considering the computational intensity and memory demands of these calculations, I am evaluating different hardware configurations and seeking advice from the community to make an informed decision.

I have the option to set up either two independent single-socket systems each powered by an EPYC 9654 processor or one dual-socket system with two EPYC 9654 processors. The dual-socket system would theoretically offer a tightly integrated environment with shared memory benefits, crucial for large-scale or memory-intensive tasks. On the other hand, two separate single-socket systems could provide greater flexibility, especially for running multiple independent tasks with lower individual resource requirements.

Given the high core count (192 cores in total for the dual-socket setup) and the nature of DFT and GW calculations, I am also pondering the appropriate amount of RAM. While I understand that memory needs can vary widely based on the specifics of the calculations, I am leaning towards configuring 512GB to 2TB of RAM, especially to accommodate the demands of GW calculations and other complex computations.

Here are my specific questions for the community: 1. Between a dual-socket EPYC 9654 system and two single-socket EPYC 9654 systems, which configuration would you recommend for the described use case? Please consider factors like parallel computing efficiency, cost-effectiveness, management simplicity, and scalability. 2. For DFT and particularly GW calculations using VASP, what would be an ideal memory configuration, considering the balance between performance and budget? Would 512GB be a starting point, or is it advisable to go straight for 1TB or more? 3. Are there any additional considerations or tips you would recommend for setting up a hardware environment optimized for VASP, especially for handling large-scale and memory-intensive calculations?

I appreciate any insights, experiences, or advice you can share. Your expertise will be invaluable in helping me configure a system that not only meets our current needs but is also scalable and cost-effective for future demands.

Thank you in advance for your help!

Best regards,
Zhao

#2 Post by **michael_wolloch** » Mon Apr 08, 2024 3:41 pm

Dear Zhao,

I am happy to make a couple of points to get a discussion going, and I hope some users will share their experiences as well.

1) A very important issue, that you do not mention, is the interconnectivity of the machines. If you have an InfiniBand connection (or 100GbE, Omni-Path, or other very vast interconnect) between the two single socket machines, the situation is different than if those machines are "only" connected through Gigabit ethernet. I will assume the latter for the rest of the post.

2) A second important issue is memory bandwidth per core. This is unfortunately a bottleneck for all modern chips with very high core count, independent of vendor and architecture. It is usually not relevant how many sockets per node are used since each CPU has the same number of memory channels. While testing on a dual-socket node with two 64c EPYC Milan (3rd gen) CPUs, we noticed that DFT stopped scaling well at 64 mpi ranks, distributed over both sockets. 4th gen has 12 instead of 8 memory channels, and they run at higher speeds (up to 4800 MT/s compared to 3200 MT/s), so maximal memory bandwidth per socket should be increased from ~205 GB/s to up to ~460 GB/s (These numbers are from the AMD website, not our testing, and will be valid only with the fastest memory modules!). For the 96c CPUs you mention this would be 4.8GB/s per core, compared to 3.2GB/s per core for a 64c EPYC Milan. Thus, for a machine specifically for VASP, you might not want to go with the 96-core chip. It should be better to use a smaller core count per socket (e.g. 64) and 2 sockets on 4th gen to get decent scaling up to the maximum number of cores available on the node. So you avoid having CPU cores that you cannot utilize due to memory bandwidth limitations. (Please note that we have not run any tests on gen4 EPYC!)

3) Regarding memory size: Buy as much as you can afford while making sure to utilize all available memory channels (see point 2). More memory will result in higher flexibility for very large systems and faster RPA/low-scaling GW calculations where memory requirements can be significant. Please read the sections on Caveats in the practical guide for GW calculations on the wiki. Also note that NTAUPAR will be selected automatically from MAXMEM, so while your GW calculations might run with 512GB of memory, they could run faster with more memory. For DFT, the amount of memory is less important once you go over a certain threshold that should be reached at 512GB already for most calculations.

4) Regarding cost-effectiveness and management-simplicity I can not make a qualified comment. That depends very much on your local vendors and your experience with similar systems I suppose.

Please understand that the VASP company cannot recommend any specific hardware and can only assist with questions regarding general hardware requirements. I would invite anyone in the community to participate in this discussion, especially sysadmins and HPC specialists!

All the best, Michael

hszhao.cn@gmail.com · #3 Post by **hszhao.cn@gmail.com** » Tue Apr 09, 2024 8:45 am

Dear Michael,

Thank you very much for your valuable experience sharing and comprehensive, thorough, and insightful analysis.

Regards,
Zhao

hszhao.cn@gmail.com · #4 Post by **hszhao.cn@gmail.com** » Tue Apr 09, 2024 2:08 pm

Here gives the Combined performance score of EPYC-9554-vs-EPYC-9654. It seems that 9554 has a higher price/performance ratio.

hszhao.cn@gmail.com · #5 Post by **hszhao.cn@gmail.com** » Tue Apr 09, 2024 2:11 pm

Another question: If I test multiple VASP jobs for benchmarking, is it more objective and accurate based on a queue scheduling system like Slurm?

#6 Post by **michael_wolloch** » Tue Apr 09, 2024 2:34 pm

Dear Zhao,

I am not sure if the link above and the combined performance score have any relevance for running VASP.

Correct, reliable, and reproducible benchmarking is, unfortunately, far from easy. However, as long as you can be sure that the environment you are running your benchmark on is the same each time, I see no problem using Slurm or another scheduler.

A couple of small, and not exhaustive, notes on benchmarking:

1) Try to benchmark a system very close to the one you want to run in production using a static calculation (I assume you are benchmarking electronic minimization).
2) Limit the number of electronic steps NELM and set EDIFF in a way that convergence is NOT reached in that number of steps.
3) Think carefully about where you put your MPI ranks and openMP processes so that you utilize the resources efficiently. Think about sockets, NUMA nodes, shared L3 cache, etc. This is especially important if you do not fully load your node (e.g. running 24 MPI ranks on a node with 2 64c CPUs).
4) Pin your processes to the cores you start on, otherwise they might jump around.
5) Control CPU clock speed when investigating scalability. Your CPU might turbo higher when you use fewer cores, which is probably not what you want to benchmark.
6) Make sure to print out all necessary environment variables that could affect the run.

hszhao.cn@gmail.com · #7 Post by **hszhao.cn@gmail.com** » Wed Apr 10, 2024 2:21 am

4) Pin your processes to the cores you start on, otherwise they might jump around.

Thank you, Michael, for emphasizing the importance of process pinning and the consideration of CPU clock speeds in scalability investigations. Here are further insights on these topics.

In my case, I use Intel MPI without a queue management system like slurm installed on the testing machine, so I think the following method should do the trick:

As you rightly pointed out, pinning processes to specific cores is crucial to prevent them from jumping around, which can significantly impact performance due to increased context switching and cache misses. For VASP simulations using Intel MPI, this can be effectively managed with environment variables via the -genv option in mpirun. To ensure each process is affixed to a specific core, I can use:

I_MPI_PIN=1 to enable process pinning.
I_MPI_PIN_DOMAIN=core to specify pinning to cores, with alternatives like socket or numa for different architectures.

Example: runs VASP with 4 processes, each pinned to its own core.

Code: Select all

module load vasp
mpirun -genv I_MPI_PIN 1 -genv I_MPI_PIN_DOMAIN core -np 4 vasp_std

For more nuanced control, such as leaving cores free or targeting specific cores for performance reasons, Intel MPI offers

Code: Select all

I_MPI_PIN_PROCESSOR_LIST

to specify the exact cores for pinning.

Example: Pinning 4 processes to the first four cores.

Code: Select all

module load vasp
mpirun -genv I_MPI_PIN 1 -genv I_MPI_PIN_DOMAIN core -genv I_MPI_PIN_PROCESSOR_LIST 0,1,2,3 -np 4 vasp_std

In cases where there are multiple nodes, it is still more convenient to use a queue system for conducting such tests.

hszhao.cn@gmail.com · #8 Post by **hszhao.cn@gmail.com** » Wed Apr 10, 2024 11:41 am

mpirun -genv I_MPI_PIN 1 -genv I_MPI_PIN_DOMAIN core -np 4 vasp_std

Based on my testing, the above command will result in a half reduction in running efficiency, which means doubling the time required. The following setting method does not have this problem:

Code: Select all

mpirun -np <number_of_processes> -bind-to core:<list_of_cores> ./your_mpi_application

In my case, this corresponds to the following setting:

Code: Select all

mpirun -bind-to core -np 4 vasp_std

hszhao.cn@gmail.com · #9 Post by **hszhao.cn@gmail.com** » Wed Apr 10, 2024 1:13 pm

2) A second important issue is memory bandwidth per core. This is unfortunately a bottleneck for all modern chips with very high core count, independent of vendor and architecture. It is usually not relevant how many sockets per node are used since each CPU has the same number of memory channels. While testing on a dual-socket node with two 64c EPYC Milan (3rd gen) CPUs, we noticed that DFT stopped scaling well at 64 mpi ranks, distributed over both sockets. 4th gen has 12 instead of 8 memory channels, and they run at higher speeds (up to 4800 MT/s compared to 3200 MT/s), so maximal memory bandwidth per socket should be increased from ~205 GB/s to up to ~460 GB/s (These numbers are from the AMD website, not our testing, and will be valid only with the fastest memory modules!). For the 96c CPUs you mention this would be 4.8GB/s per core, compared to 3.2GB/s per core for a 64c EPYC Milan. Thus, for a machine specifically for VASP, you might not want to go with the 96-core chip. It should be better to use a smaller core count per socket (e.g. 64) and 2 sockets on 4th gen to get decent scaling up to the maximum number of cores available on the node. So you avoid having CPU cores that you cannot utilize due to memory bandwidth limitations. (Please note that we have not run any tests on gen4 EPYC!)

It seems that you're correct. Based on the test example for vasp here, the performance comparison between AMD EPYC dual 9654 vs 9554 using Samsung memory with 32G 4800 MT/s *24 shows that the dual 9554 has better scalability and cost-effectiveness.

#10 Post by **michael_wolloch** » Wed Apr 10, 2024 2:23 pm

Not that the link above does not show a comparison between the two EPYC chips. There are some execution times shown for different compiles on mainstream consumer hardware, but no real benchmarks.

By the VASP license agreement, it is prohibited to publish benchmarks without prior approval from the VASP company. You are free to run benchmarks for your personal use of course, but as stated previously this is not trivial.

Based on my testing, the above command will result in a half reduction in running efficiency, which means doubling the time required. The following setting method does not have this problem

Indeed, just pinning processes can also degrade performance. E.g. if you run with 16 MPI ranks and put them all on a single CPU in a dual socket system with two 16-core CPUs, you might expect increased performance due to avoiding socket-to-socket communication. Instead, you will leave half of your available memory bandwidth on the table and it will be much faster to put 8 ranks on each CPU in most instances. If you consider NUMA domains and shared cache as well, the situation gets even more complex.

In short, it is not trivial to put your processes on the hardware, especially for large CPUs, and if you do not fully populate all cores. Mulit-node setups are even more complicated, unfortunately.

hszhao.cn@gmail.com · #11 Post by **hszhao.cn@gmail.com** » Wed Apr 10, 2024 11:02 pm

hszhao.cn@gmail.com wrote: ↑Wed Apr 10, 2024 11:41 am
mpirun -genv I_MPI_PIN 1 -genv I_MPI_PIN_DOMAIN core -np 4 vasp_std
Based on my testing, the above command will result in a half reduction in running efficiency, which means doubling the time required. The following setting method does not have this problem:
Code: Select all
mpirun -np <number_of_processes> -bind-to core:<list_of_cores> ./your_mpi_application
In my case, this corresponds to the following setting:
Code: Select all
mpirun -bind-to core -np 4 vasp_std

What confuses me is: why does -bind-to core not lead to a significant reduction in computational efficiency compared to -genv I_MPI_PIN 1 -genv I_MPI_PIN_DOMAIN core?

hszhao.cn@gmail.com · #12 Post by **hszhao.cn@gmail.com** » Wed Apr 10, 2024 11:22 pm

Another question is: For programs like VASP, when is it beneficial to use hyper-threading's virtual cores, and when does it actually reduce efficiency?

#13 Post by **michael_wolloch** » Thu Apr 11, 2024 7:03 am

Dear Zhao,

this post has gotten a bit far from the original question for my taste. If you want to discuss benchmarking and the intricacies of process pinning, I would suggest making a new post in the "users for users" section.

What confuses me is: why does -bind-to core not lead to a significant reduction in computational efficiency compared to -genv I_MPI_PIN 1 -genv I_MPI_PIN_DOMAIN core?

You are mixing openMPI and intelMPI command line arguments here. Without going into detail, it is important to know where the processes end up. Use -genv I_MPI_DEBUG=4 for intelMPI and --report-bindings for OpenMPI to check.

My Community

Hardware Configuration Advice for DFT and GW Calculations with VASP.

Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.

Re: Hardware Configuration Advice for DFT and GW Calculations with VASP.