Dear,
I figured out what the issue was. Thanks to your help, and some other reading/testing, I have found that the problem was related to the VASP not allowing for more than one CPU physical core per task.
Basically, when I was requesting 48 tasks, and 4 cpus per task, I was only using 25% of my nodes, or 12.5 if we would count the 192 physical cores+ 192 hyperthreaded cores per node.
This was confirmed when I checked the workload of the individual CPUs, where I found that 1 out of every consecutive 4 CPUs from the first 192 CPUs (CPUs 0 - 191; the physical cores) is only used, while none of the hyperthreaded CPUs (CPUs 192 - 383) are not used. Note that here I am talking about the utilization of each node.
As such, I doubt that when I used 192 tasks, the efficiency got worst due to over parallelization. As essentially, when I was requesting 8 nodes earlier (tests 0-2) I was only realistically using 25% of each node, so 2 nodes effectively. But when I fixed the issue, while still requesting 8 nodes, maybe the workload just got spread too thin.
I guess my next step will be sticking to 192 tasks per node, but lowering the number of requested nodes. Additionally, I guess that the value of NCORE, which I previously found to be best at 24, needs to be revisited? As doesn't this mean that effectively also NCORE was being used as 24/4=6?
I would like to head your thoughts of what I found, does it make sense? Additionally, is hyperthreading worth testing? As I can request the IT team to install a version of VASP that can use multithreading, although I suspect that maybe the multithreading did not work due to me adding the --hint=nomultithread in my srun.