This document describes building CP2K with several (optional) libraries, which may be beneficial in terms of functionality and performance.
- Intel Math Kernel Library (also per Linux' distro's package manager) acts as:
- LAPACK/BLAS and ScaLAPACK library
- FFTw library
- LIBXSMM (replaces LIBSMM)
- LIBINT (version 1.1.5 or 1.1.6)
- LIBXC (version 4.3 or any 4.x)
- ELPA (version 2017.11.001)
The ELPA library eventually improves the performance (must be currently enabled for each input file even if CP2K was built with ELPA). There is also the option to auto-tune additional routines in CP2K (integrate/collocate) and to collect the generated code into an archive referred as LIBGRID.
For high performance, LIBXSMM (see also https://libxsmm.readthedocs.io) has been incorporated since CP2K 3.0. When CP2K is built with LIBXSMM, CP2K's "libsmm" library is not used and hence libsmm does not need to be built and linked with CP2K.
There are no configuration wrapper scripts provided for CP2K since a configure-step is usually not required, and the application can be built right away. CP2K's
tools/toolchain) is out of scope in this document (it builds the entire tool chain from source including the compiler).
wget --no-check-certificate https://github.com/hfp/xconfigure/raw/master/configure-get.sh chmod +x configure-get.sh ./configure-get.sh cp2k
wget --no-check-certificate https://github.com/hfp/xconfigure/raw/master/config/cp2k/info.sh chmod +x info.sh wget --no-check-certificate https://github.com/hfp/xconfigure/raw/master/config/cp2k/plan.sh chmod +x plan.sh
This step-by-step guide aims to build an MPI/OpenMP-hybrid version of the official release of CP2K by using the GNU Compiler Collection, Intel MPI, Intel MKL, LIBXSMM, ELPA, LIBXC, and LIBINT. Internet connectivity is assumed on the build-system. Please note that such limitations can be worked around or avoided with additional steps. However, this simple step-by-step guide aims to make some reasonable assumptions.
As the step-by-step guide uses GNU Fortran (version 8.3 is recommended), only Intel MKL (2019.x recommended) and Intel MPI (2018.x recommended) need to be sourced (sourcing all Intel development tools of course does not harm).
NOTE: GNU GCC version 7.x or 8.x is highly recommended (CP2K built with GCC 9.1 does not pass regression tests).
source /opt/intel/compilers_and_libraries_2018.5.274/linux/mpi/intel64/bin/mpivars.sh source /opt/intel/compilers_and_libraries_2019.3.199/linux/mkl/bin/mklvars.sh intel64
To install Intel Math Kernel Library and Intel MPI from a public repository depends on the Linux distribution's package manager. For newer distributions, both libraries are likely part of the official repositories. Otherwise a suitable repository must be added to the package manager (not subject of this document). For example, installing with
yum looks like:
sudo yum install intel-mkl-2019.4-070.x86_64 sudo yum install intel-mpi-2018.3-051.x86_64
Please note, the ARCH file (used later/below to build CP2K) attempts to find Intel MKL even if the
MKLROOT environment variable is not present. The MPI library is implicitly known when using compiler wrapper scripts (no need for
I_MPI_ROOT). Installing the proper software stack and drivers for an HPC fabric to be used by MPI is out of scope in this document. If below check fails, the MPI's bin-folder must be added to the path.
$ mpif90 --version GNU Fortran (GCC) 8.3.0 Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
cd $HOME wget https://elpa.mpcdf.mpg.de/html/Releases/2017.11.001/elpa-2017.11.001.tar.gz tar xvf elpa-2017.11.001.tar.gz cd elpa-2017.11.001 wget --no-check-certificate https://github.com/hfp/xconfigure/raw/master/configure-get.sh chmod +x configure-get.sh ./configure-get.sh elpa ./configure-elpa-skx-gnu-omp.sh make -j make install make clean
The second step builds LIBINT (1.1.6 recommended, newer version cannot be used). This library does not compile on an architecture with less CPU-features than the target (e.g.,
configure-libint-skx-gnu.sh implies to build on "Skylake" or "Cascadelake" server).
cd $HOME wget --no-check-certificate https://github.com/evaleev/libint/archive/release-1-1-6.tar.gz tar xvf release-1-1-6.tar.gz cd libint-release-1-1-6 wget --no-check-certificate https://github.com/hfp/xconfigure/raw/master/configure-get.sh chmod +x configure-get.sh ./configure-get.sh libint ./configure-libint-skx-gnu.sh make -j make install make distclean
The third step builds LIBXC (any version of the 4.x series can be used).
cd $HOME wget --content-disposition https://gitlab.com/libxc/libxc/-/archive/4.3.4/libxc-4.3.4.tar.bz2 tar xvf libxc-4.3.4.tar.bz2 cd libxc-4.3.4 wget --no-check-certificate https://github.com/hfp/xconfigure/raw/master/configure-get.sh chmod +x configure-get.sh ./configure-get.sh libxc ./configure-libxc-skx-gnu.sh make -j make install make distclean
The fourth step makes LIBXSMM available, which is compiled as part of the next step.
cd $HOME wget --no-check-certificate https://github.com/hfp/libxsmm/archive/1.14.tar.gz tar xvf 1.14.tar.gz
This last step builds the PSMP-variant of CP2K. Please re-download the ARCH-files from GitHub as mentioned below (avoid reusing older/outdated files). If Intel MKL is not found, the key
MKLROOT=/path/to/mkl can be added to Make's command line. To select a different MPI implementation one can try e.g.,
patch -p0 src/mpiwrap/message_passing.F mpi-wrapper.diff).
cd $HOME wget https://github.com/cp2k/cp2k/archive/v6.1.0.tar.gz tar xvf v6.1.0.tar.gz cd cp2k-6.1.0 wget --no-check-certificate https://github.com/hfp/xconfigure/raw/master/configure-get.sh chmod +x configure-get.sh ./configure-get.sh cp2k patch -p0 src/pw/fft/fftw3_lib.F intel-mkl.diff rm -rf exe lib obj cd makefiles make ARCH=Linux-x86-64-intelx VERSION=psmp GNU=1 AVX=3 MIC=0 \ LIBINTROOT=$HOME/libint/gnu-skx \ LIBXCROOT=$HOME/libxc/gnu-skx \ ELPAROOT=$HOME/elpa/gnu-skx-omp -j
If no LIBXSMMMROOT was given (auto detection), the initial output of the build looks like:
Discovering programs ... ================================================================================ Automatically enabled LIBXSMM (LIBXSMMROOT=/path/to/libxsmm) ================================================================================ LIBXSMM release-1.14 (Linux) --------------------------------------------------------------------------------
Once the build completed, the CP2K executable should be ready (
$ LIBXSMM_VERBOSE=1 exe/Linux-x86-64-intelx/cp2k.psmp [...] LIBXSMM_VERSION: release-1.14 LIBXSMM_TARGET: clx
Have a look at Running CP2K to learn more about pinning MPI processes (and OpenMP threads), and to try a first workload.
- Intel Compiler 2017 (u0, u1, u2, u3), and the initial release of MKL 2017 (u0)
- source /opt/intel/compilers_and_libraries_2017.[u0-u3]/linux/bin/compilervars.sh intel64
source /opt/intel/compilers_and_libraries_2017.0.098/linux/mkl/bin/mklvars.sh intel64
- source /opt/intel/compilers_and_libraries_2017.[u0-u3]/linux/bin/compilervars.sh intel64
- Intel Compiler 2017 Update 4, and any later update of the 2017 suite (u4, u5, u6, u7)
- source /opt/intel/compilers_and_libraries_2017.[u4-u7]/linux/bin/compilervars.sh intel64
- Intel Compiler 2018 (u3, u4, u5): only with CP2K/development (not with CP2K 6.1 or earlier)
- source /opt/intel/compilers_and_libraries_2018.3.222/linux/bin/compilervars.sh intel64
- source /opt/intel/compilers_and_libraries_2018.5.274/linux/bin/compilervars.sh intel64
- Intel Compiler 2019 (u1, u2, u3): failure at runtime
- Intel MPI; usually any version is fine: Intel MPI 2018 is recommended
Please note, with respect to component versions it is possible to source from different Intel suites.
Intel ARCH File¶
CP2K 6.1 includes
arch directory) as a starting point for writing an own ARCH-file (note:
Linux-x86-64-intelx.*). Remember, performance critical code is often located in libraries (hence
-O2 optimizations for CP2K's source code are enough in almost all cases), more important for performance are target-flags such as
-mavx2 -mfma. Prior to Intel Compiler 2018, the flag
-fp-model source (FORTRAN) and
-fp-model precise (C/C++) were key for passing CP2K's regression tests. If an own ARCH file is used or prepared, all libraries including LIBXSMM need to be built separately and referred in the link-line of the ARCH-file. In addition, CP2K may need to be informed and certain preprocessor symbols need to be given during compilation (
-D compile flag). For further information, please follow the official guide and consider the CP2K Forum in case of trouble.
The purpose of the Intel ARCH files is to avoid writing an own ARCH-file even when GNU Compiler is used. Taking the Intel ARCH files that are part of the CP2K/Intel fork automatically picks up the correct paths for Intel libraries. These paths are determined by using the environment variables setup when the Intel tools are source'd. Similarly, LIBXSMMROOT (which can be supplied on Make's command line) is discovered automatically if it is in the user's home directory, or when it is in parallel to the CP2K directory. The Intel ARCH files not only work with CP2K/Intel fork but even if an official release of CP2K is built (which is also encouraged). Of course, one can download the afore mentioned Intel ARCH files manually:
cd cp2k-6.1.0/arch wget https://github.com/hfp/cp2k/raw/master/arch/Linux-x86-64-intelx.arch wget https://github.com/hfp/cp2k/raw/master/arch/Linux-x86-64-intelx.popt wget https://github.com/hfp/cp2k/raw/master/arch/Linux-x86-64-intelx.psmp wget https://github.com/hfp/cp2k/raw/master/arch/Linux-x86-64-intelx.sopt wget https://github.com/hfp/cp2k/raw/master/arch/Linux-x86-64-intelx.ssmp
Running CP2K may go beyond a single node, and pinning processes and threads becomes even more important. There are several schemes available. As a rule of thumb, a high rank-count for lower node-counts may yield best results unless the workload is very memory intensive. In the latter case, lowering the number of MPI-ranks per node is effective especially if a larger amount of memory is replicated rather than partitioned by the rank-count. In contrast (communication bound), a lower rank count for multi-node computations may be desired.
Most important, in most cases CP2K prefers a total rank-count to be a square-number which leads to some complexity when aiming for rank/thread combinations that exhibit good performance properties. Please refer to the documentation of the script for planning MPI/OpenMP-hybrid (
plan.sh), which illustrates running CP2K's PSMP-binary on an HT-enabled dual-socket system with 24 cores per processor/socket (96 hardware threads). The single-node execution with 16 ranks and 6 threads per rank looks like (
mpirun -np 16 \ -genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_PIN_ORDER=bunch \ -genv OMP_PLACES=threads -genv OMP_PROC_BIND=SPREAD \ -genv OMP_NUM_THREADS=6 \ exe/Linux-x86-64-intelx/cp2k.psmp workload.inp
For an MPI command line targeting 8 nodes,
plan.sh was used to setup 8 ranks per node with 12 threads per rank (
mpirun -perhost 8 -host node1,node2,node3,node4,node5,node6,node7,node8 \ -genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_PIN_ORDER=bunch \ -genv OMP_PLACES=threads -genv OMP_PROC_BIND=SPREAD \ -genv OMP_NUM_THREADS=12 -genv I_MPI_DEBUG=4 \ exe/Linux-x86-64-intelx/cp2k.psmp workload.inp
NOTE: the documentation of
plan.sh also motivates and explains the MPI environment variables as shown in above MPI command lines.
The script for planning MPI-execution (
plan.sh) is highly recommend along with reading the section about how to run CP2K. For CP2K, the MPI-communication patterns can be tuned in most MPI-implementations. For Intel MPI, the following setting can be beneficial:
export I_MPI_COLL_INTRANODE=pt2pt export I_MPI_ADJUST_REDUCE=1 export I_MPI_ADJUST_BCAST=1
For large-scale runs, the startup can be tuned, but typically this is not necessary. However, the following may be useful (and does not harm):
export I_MPI_DYNAMIC_CONNECTION=1 export I_MPI_HARD_FINALIZE=1
Intel MPI usually nicely determines the fabric settings for both Omnipath and InfiniBand, and no adjustment is needed. However, people often prefer explicit settings even if it does not differ from what is determined automatically. For example, InfiniBand with RDMA can be set explicitly by using
mpirun -rdma which can be also achieved with environment variables:
echo "'mpirun -rdma' and/or environment variables for InfiniBand" export I_MPI_FABRICS=shm:dapl export I_MPI_RDMA_TRANSLATION_CACHE=1 export I_MPI_CHECK_DAPL_PROVIDER_COMPATIBILITY=0
As soon as several experiments are finished, it becomes handy to summarize the log-output. For this case, an info-script (
info.sh) is available attempting to present a table (summary of all results), which is generated from log files (use
tee, or rely on the output of the job scheduler). There are only certain file extensions supported (
.log). If no file matches, then all files (independent of the file extension) are attempted to be parsed (which will go wrong eventually). If for some reason the command to launch CP2K is not part of the log and the run-arguments cannot be determined otherwise, the number of nodes is eventually parsed by using the filename of the log itself (e.g., first occurrence of a number along with an optional "n" is treated as the number of nodes used for execution).
./run-cp2k.sh | tee cp2k-h2o64-2x32x2.txt ls -1 *.txt cp2k-h2o64-2x32x2.txt cp2k-h2o64-4x16x2.txt ./info.sh [-best] /path/to/logs-or-cwd H2O-64 Nodes R/N T/R Cases/d Seconds cp2k-h2o64-2x32x2 2 32 4 807 107.237 cp2k-h2o64-4x16x2 4 16 8 872 99.962
Please note that the "Cases/d" metric is calculated with integer arithmetic and hence represents fully completed cases per day (based on 86400 seconds per day). The number of seconds (as shown) is end-to-end (wall time), i.e. total time to solution including any (sequential) phase (initialization, etc.). Performance is higher if the workload requires more iterations (some publications present a metric based on iteration time).
There is nothing that can replace the full regression test suite. However, to quickly check whether a build is sane or not, one can run for instance
tests/QS/benchmark/H2O-64.inp and check if the SCF iteration prints like the following:
Step Update method Time Convergence Total energy Change ------------------------------------------------------------------------------ 1 OT DIIS 0.15E+00 0.5 0.01337191 -1059.6804814927 -1.06E+03 2 OT DIIS 0.15E+00 0.3 0.00866338 -1073.3635678409 -1.37E+01 3 OT DIIS 0.15E+00 0.3 0.00615351 -1082.2282197787 -8.86E+00 4 OT DIIS 0.15E+00 0.3 0.00431587 -1088.6720379505 -6.44E+00 5 OT DIIS 0.15E+00 0.3 0.00329037 -1092.3459788564 -3.67E+00 6 OT DIIS 0.15E+00 0.3 0.00250764 -1095.1407783214 -2.79E+00 7 OT DIIS 0.15E+00 0.3 0.00187043 -1097.2047924571 -2.06E+00 8 OT DIIS 0.15E+00 0.3 0.00144439 -1098.4309205383 -1.23E+00 9 OT DIIS 0.15E+00 0.3 0.00112474 -1099.2105625375 -7.80E-01 10 OT DIIS 0.15E+00 0.3 0.00101434 -1099.5709299131 -3.60E-01 [...]
The column called "Convergence" must monotonically converge towards zero.
The Intel fork of CP2K was formerly a branch of CP2K's Git-mirror. CP2K is meanwhile natively hosted at GitHub. Ongoing work in the Intel branch was supposed to tightly track the master version of CP2K, which is also true for the fork. In addition, valuable topics may be upstreamed in a timelier fashion. To build CP2K/Intel from source for experimental purpose, one may rely on Intel Compiler 16, 17, or 18 series:
source /opt/intel/compilers_and_libraries_2018.3.222/linux/bin/compilervars.sh intel64
LIBXSMM is automatically built in an out-of-tree fashion when building CP2K/Intel fork. The only prerequisite is that the LIBXSMMROOT path needs to be detected (or supplied on the
make command line). LIBXSMMROOT is automatically discovered automatically if it is in the user's home directory, or when it is in parallel to the CP2K directory. By default (no
MIC is given), the build process is carried out by using the
-xHost target flag. For example, to explicitly target "Cascadelake" or "Skylake" server ("SKX"):
git clone https://github.com/hfp/libxsmm.git git clone https://github.com/hfp/cp2k.git cd cp2k git submodule update --init --recursive rm -rf lib obj make ARCH=Linux-x86-64-intelx VERSION=psmp AVX=3 MIC=0
NOTE: Most if not all hot-spots in CP2K are covered by libraries (e.g., LIBXSMM). It can be beneficial to rely on the GNU Compiler tool-chain. To only use Intel libraries such as Intel MPI and Intel MKL, one can rely on the GNU-key (
The GNU tool-chain requires to configure LIBINT, LIBXC, and ELPA accordingly (e.g.,
configure-elpa-skx-gnu-omp.sh instead of
configure-elpa-skx-omp.sh). To further adjust CP2K at build time, additional key-value pairs (like
VERSION=psmp) can be passed at Make's command line when relying on CP2K/Intel's ARCH files.
- SYM: set
SYM=1to include debug symbols into the executable e.g., helpful with performance profiling.
- DBG: set
DBG=1to include debug symbols, and to generate non-optimized code.
Dynamic allocation of heap memory usually requires global book keeping eventually incurring overhead in shared-memory parallel regions of an application. For this case, specialized allocation strategies are available. To use such a strategy, memory allocation wrappers can be used to replace the default memory allocation at build-time or at runtime of an application.
To use the malloc-proxy of the Intel Threading Building Blocks (Intel TBB), rely on the
TBBMALLOC=1 key-value pair at build-time of CP2K (default:
TBBMALLOC=0). Usually, Intel TBB is already available when sourcing the Intel development tools (one can check the TBBROOT environment variable). To use TCMALLOC as an alternative, set
TCMALLOCROOT at build-time of CP2K by pointing to TCMALLOC's installation path (configured per
./configure --enable-minimal --prefix=<TCMALLOCROOT>).