makefile.include for intel compilers uses multithreaded mkl

Message

fish · #1 Post by **fish** » Thu Jun 09, 2016 5:06 pm

To everyone building VASP with the Intel compilers and MKL libraries,

The makefile.include provided with the vasp.5.4.1 distribution uses the "-mkl" option for the linker (or the FCL variable). Note that this will link to the multithreaded MKL libraries under v16 of the Inter compilers. This can lead to poor performance, if the number of openMP threads generated by multithreaded MKL is greater than the number of cores on your node.

I would recommend using the -mkl=sequential option or not using the -mkl option and explicitly listing the required MKL libraries in the BLAS variable of makefile.include.

I have attached an makefile.include with this change as an example.

John J. Low
Argonne National Laboratory

tgomez · #2 Post by **tgomez** » Mon Mar 27, 2017 12:50 am

Hi, could you upload the file sample?
I have problems with the compilation and your input would help me so much.
Thanks

cacarden · #3 Post by **cacarden** » Fri Sep 22, 2017 7:27 pm

Dear John:
It seems that you did not attach the make.iluclude. Would you please share with us?
Thanks,
Carlos

fish · #4 Post by **fish** » Thu May 10, 2018 3:39 pm

Since I could not add an attachment, I cut and pasted an example of a "makefile.include". Sorry for the delay.
# Precompiler options
CPP_OPTIONS= -DHOST=\"Bebop\ BDW\ IFC17.0.4\ IMPI17.0.3\"\
-DMPI -DMPI_BLOCK=8000 \
-Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Duse_bse_te \
-Dtbdyn \
-Duse_shmem

CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC = mpiifort
FCL = mpiifort -mkl=sequential -lstdc++

FREE = -free -names lowercase

FFLAGS = -assume byterecl -w
OFLAG = -O2 -fma -xCORE-AVX2
OFLAG_IN = $(OFLAG)
DEBUG = -O0

MKL_PATH = $(MKLROOT)/lib/intel64
BLAS =
LAPACK =
BLACS = -lmkl_blacs_intelmpi_lp64
SCALAPACK = $(MKL_PATH)/libmkl_scalapack_lp64.a $(BLACS)

OBJECTS = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d.o

INCS =-I$(MKLROOT)/include/fftw

LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS)

OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o

# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = icc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)

OBJECTS_LIB= linpack_double.o getshmem.o

# For the parser library
CXX_PARS = icpc

LIBS += parser
LLIBS += -Lparser -lparser -lstdc++

# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin

#================================================
# GPU Stuff

CPP_GPU = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DUSE_PINNED_MEMORY -DCUFFT_MIN=28 -UscaLAPACK

OBJECTS_GPU = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o

CC = icc
CXX = icpc
CFLAGS = -fPIC -DADD_ -Wall -openmp -DMAGMA_WITH_MKL -DMAGMA_SETAFFINITY -DGPUSHMEM=300 -DHAVE_CUBLAS

CUDA_ROOT ?= /usr/local/cuda/
NVCC := $(CUDA_ROOT)/bin/nvcc -ccbin=icc
CUDA_LIB := -L$(CUDA_ROOT)/lib64 -lnvToolsExt -lcudart -lcuda -lcufft -lcublas

GENCODE_ARCH := -gencode=arch=compute_30,code=\"sm_30,compute_30\" \
-gencode=arch=compute_35,code=\"sm_35,compute_35\" \
-gencode=arch=compute_60,code=\"sm_60,compute_60\"

MPI_INC = $(I_MPI_ROOT)/include64/

mersad_mostaghimi · #5 Post by **mersad_mostaghimi** » Sun Feb 16, 2020 4:19 pm

Dear all,
we have a HPC with several nodes, each nodes has four cpus with below specification:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 288
On-line CPU(s) list: 0-287
Thread(s) per core: 4
Core(s) per socket: 72
Socket(s): 1
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 87
Model name: Intel(R) Xeon Phi(TM) CPU 7290 @ 1.50GHz
Stepping: 1
CPU MHz: 1501.000
CPU max MHz: 1501.0000
CPU min MHz: 1000.0000
BogoMIPS: 2999.94
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
NUMA node0 CPU(s): 0-287
NUMA node1 CPU(s):
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ring3mwait epb ibrs ibpb fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt dtherm ida arat pln pts spec_ctrl

we used below makefile.include:
# Precompiler options
CPP_OPTIONS= -DHOST=\"LinuxIFC\"\
-DMPI -DMPI_BLOCK=64000 \
-Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=32000 \
-Davoidalloc \
-Duse_bse_te \
-Dtbdyn \
-Duse_shmem \
-march=knl

CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC = mpiifort -march=knl
FCL = mpiifort -mkl=cluster -lstdc++ -march=knl

FREE = -free -names lowercase

FFLAGS = -FR -names lowercase -assume byterecl -march=knl
OFLAG = -O3 -xhost -march=knl
OFLAG_IN = $(OFLAG)
DEBUG = -O0

MKL_PATH = $(MKLROOT)/lib/intel64
BLAS = -mkl=cluster
LAPACK =
BLACS = -lmkl_blacs_intelmpi_lp64
SCALAPACK = $(MKL_PATH)/libmkl_scalapack_lp64.a $(BLACS)

OBJECTS = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d.o

INCS = -I$(MKLROOT)/include/fftw -I/compilers_and_libraries/linux/mpi/intel64/include

LLIBS = -L/compilers_and_libraries/linux/mpi/intel64/lib $(SCALAPACK) $(LAPACK) $(BLAS)

OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = icc -march=knl
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)

OBJECTS_LIB= linpack_double.o getshmem.o

# For the parser library
CXX_PARS = icpc -march=knl

LIBS += parser
LLIBS += -Lparser -lparser -lstdc++

# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin

#================================================
# GPU Stuff

CPP_GPU = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DUSE_PINNED_MEMORY -DCUFFT_MIN=28 -UscaLAPACK

OBJECTS_GPU = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o

CC = icc
CXX = icpc
CFLAGS = -fPIC -DADD_ -Wall -openmp -DMAGMA_WITH_MKL -DMAGMA_SETAFFINITY -DGPUSHMEM=300 -DHAVE_CUBLAS

CUDA_ROOT ?= /usr/local/cuda/
NVCC := $(CUDA_ROOT)/bin/nvcc -ccbin=icc
CUDA_LIB := -L$(CUDA_ROOT)/lib64 -lnvToolsExt -lcudart -lcuda -lcufft -lcublas

GENCODE_ARCH := -gencode=arch=compute_30,code=\"sm_30,compute_30\" \
-gencode=arch=compute_35,code=\"sm_35,compute_35\" \
-gencode=arch=compute_60,code=\"sm_60,compute_60\"

MPI_INC = $(I_MPI_ROOT)/include64/

we take a speed up test for a sample. our sample in another hpc with below specification on each node(has 2 socket , it means 24 core per node) is 4 times faster per iteration.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Stepping: 2
CPU MHz: 2876.250
CPU max MHz: 3500.0000
CPU min MHz: 1200.0000
BogoMIPS: 5187.75
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm c
onstant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid d
ca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_
adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc

my question is that: is this a problem of -mkl=sequential and threading problem. what is the best and optimize compilation for these kind of system?!

mersad_mostaghimi · #6 Post by **mersad_mostaghimi** » Wed May 13, 2020 1:50 pm

@fish
the include file you mentioned didn't rise up performance at all. It seems the problem related to the architecture of the processors. I used the method of mentioned on the below webpage
wiki/index.php/NPAR
but none of them rise up performance.

john_low1 · #7 Post by **john_low1** » Wed Jul 22, 2020 5:46 pm

The best thing to do is to experiment. One would expect that -mkl=sequential would be slower. But you don't always get what you expect.

My Community

makefile.include for intel compilers uses multithreaded mkl

makefile.include for intel compilers uses multithreaded mkl

Re: makefile.include for intel compilers uses multithreaded

Re: makefile.include for intel compilers uses multithreaded

Re: makefile.include for intel compilers uses multithreaded

Re: makefile.include for intel compilers uses multithreaded mkl

Re: makefile.include for intel compilers uses multithreaded mkl

Re: makefile.include for intel compilers uses multithreaded mkl