EXA2PRO: HiePACS parallel linear solvers

Table of Contents

This is a hands-on tutorial for introducing some of the Inria HiePACS team parallel linear solvers.

We will install the softwares thanks to GNU Guix for the sake of reproducibility.

We will show how to use them on a supercomputer, here PlaFRIM. Notice that on this cluster Guix has been installed by administers.

Lets login on PlaFRIM

ssh plafrim

1. Introduction to GNU Guix

1.1. Key features

  • reproducibility: same code on laptop & cluster
  • versatility: “package management”, container provisioning, etc.
  • transactional upgrades and rollback
  • PlaFRIM support!

How to install Guix:

# Guix requires a running GNU/Linux system, GNU tar and Xz.
gpg --keyserver pgp.mit.edu --recv-keys 3CE464558A84FDC69DB40CFB090B11993D9AEBB5
wget https://git.savannah.gnu.org/cgit/guix.git/plain/etc/guix-install.sh
chmod +x guix-install.sh
sudo ./guix-install.sh

1.2. Installation, Removal

# look for available packages
guix search ^openmpi

# list installed packages
guix package -I

# install packages
guix install hwloc gcc-toolchain
which lstopo

# where are located installed files
ls ~/.guix-profile/bin
cat ~/.guix-profile/etc/profile

# remove packages
guix remove hwloc gcc-toolchain

# list iterations
guix package --list-generations

# come back to a previous state of packages installation
guix package --roll-back
guix package --switch-generation=+1

1.3. Update and Upgrade

# update Guix and the list of available packages
guix pull

# upgrade installed packages
guix upgrade

1.4. Reproduce

# current guix commit
guix --version

# rewind packages recipies to a past version
guix pull --commit=adfbba4

# save current guix state (commits of guix and of additional channels)
guix describe -f channels > channels-date.scm

# restore guix state (+ external channels) to a past version
guix pull --allow-downgrades -C channels-date.scm

1.5. Environment

# setup a development environment ready to be used for compiling petsc (i.e. compilers + dependencies)
guix environment petsc

# use the following to get an environment isolated from the the system one
guix environment --pure petsc -- /bin/bash --norc

1.6. Docker and Singularity

# create docker image
guix pack -f docker petsc -S /bin=bin --entry-point=/bin/bash

# create singularity image
guix pack -f squashfs petsc -S /bin=bin --entry-point=/bin/bash

1.7. Get additional Inria packages

See our Gitlab project guix-hpc-non-free where packages for installing Chameleon, PaStiX and MaPHyS can be found.

It is necessary to add a channel to get new packages. Create a ~/.config/guix/channels.scm file with the following snippet:

(cons (channel
    (name 'guix-hpc-non-free)
    (url "https://gitlab.inria.fr/guix-hpc/guix-hpc-non-free.git"))
  %default-channels)

Update guix packages definition

guix pull

Update new guix in the path

PATH="$HOME/.config/guix/current/bin${PATH:+:}$PATH"
hash guix

For further shell sessions, add this to the ~/.bash_profile file

export PATH="$HOME/.config/guix/current/bin${PATH:+:}$PATH"
export GUIX_LOCPATH="$HOME/.guix-profile/lib/locale"

New packages are now available

guix search ^starpu
guix search ^chameleon
guix search ^pastix
guix search ^maphys

2. Chameleon dense linear solver

2.1. Installation

2.1.1. Using Guix

guix install openssh openmpi chameleon --with-input=openblas=mkl

2.1.2. Using CMake

Without Guix one has to get installed the following components:

  • CMake
  • GNU GCC
  • MPI (e.g. Open MPI),
  • Intel MKL (or OpenBLAS),
  • Nvidia CUDA/cuBLAS (optional),
  • StarPU (e.g. latest release 1.3.7)

Most of them should be already provided through modules and for StarPU:

# on your laptop
wget https://files.inria.fr/starpu/starpu-1.3.7/starpu-1.3.7.tar.gz
scp starpu-1.3.7.tar.gz plafrim:

# install starpu on the cluster
ssh plafrim

# load the modules for compilers (C/C++, Fortran), MPI, CUDA
module purge
module add build/cmake/3.15.3
module add compiler/gcc/9.3.0
module add compiler/cuda/11.2
module add hardware/hwloc/2.1.0
module add mpi/openmpi/4.0.2
module add linalg/mkl/2019_update4

tar xvf starpu-1.3.7.tar.gz
cd starpu-1.3.7
export STARPU_ROOT=$PWD/install
./configure --prefix=$STARPU_ROOT
make -j5 install

Then one can install Chameleon using CMake, e.g.

# on your laptop
wget https://gitlab.inria.fr/solverstack/chameleon/uploads/6115a9dd4dc21da4aed13a14b2de49d3/chameleon-1.0.0.tar.gz
scp chameleon-1.0.0.tar.gz plafrim:

# install chameleon on the cluster
ssh plafrim
tar xvf chameleon-1.0.0.tar.gz
cd chameleon-1.0.0
mkdir build && cd build

module purge
module add build/cmake/3.15.3
module add compiler/gcc/9.3.0
module add compiler/cuda/11.2
module add hardware/hwloc/2.1.0
module add mpi/openmpi/4.0.2
module add linalg/mkl/2019_update4
# starpu should be available in the environement, for example with
export PKG_CONFIG_PATH=$STARPU_ROOT/lib/pkgconfig:$PKG_CONFIG_PATH

# configure, make, install
cmake .. -DCHAMELEON_USE_MPI=ON -DCHAMELEON_USE_CUDA=ON -DBLA_VENDOR=Intel10_64lp_seq -DCMAKE_INSTALL_PREFIX=$PWD/install
make -j5 install

# test
mpiexec -np 2 ./testing/chameleon_stesting -o gemm -n 2000 -t 2

2.2. Using Chameleon to test dense linear algebra performances

Chameleon provides executables to test dense linear algebra algorithms. These drivers allow to

  • check numerical results,
  • time the execution and
  • deduce the performance by computing the gigaflop rate, i.e. the number of operations performed by seconds.

By default four executables are provided, one for each arithmetic precision, chameleon_stesting for simple precision, and the equivalent for other precisions with d (double), c (complex simple), z (complex double).

To get all parameters

chameleon_stesting --help

See also in the documentation the available algorithms which can be checked.

For example to check the performances of GEMM in simple or double precision using 2 threads (using the guix installation here)

mpiexec -np 1 chameleon_stesting -o gemm -n 3200 -t 2
mpiexec -np 1 chameleon_dtesting -o gemm -n 3200 -t 2

is performed around 0.25 s for simple and 0.5 s for double and gives around 270 Gflop/s and 130 Gflop/s respectively.

To check a Cholesky

mpiexec -np 1 chameleon_stesting -o potrf -n 3200 -t 2

which give around 245 Gflop/s in this configuration.

Lets reserve some machines with Slurm to check performances on entire nodes

# slurm reservation
salloc --nodes=2 --time=01:00:00 --constraint bora --exclusive --ntasks-per-node=1 --threads-per-core=1

# check nodes obtained
mpiexec hostname

Chameleon on one node

mpiexec -np 1 chameleon_stesting -o potrf -n 3200,6400,16000,32000

gives

Id;Function;threads;gpus;P;Q;nb;uplo;n;lda;seedA;time;gflops
0;spotrf;35;0;1;1;320;121;3200;3200;846930886;1.301987e-02;8.393163e+02
1;spotrf;35;0;1;1;320;121;6400;6400;1681692777;3.153522e-02;2.771562e+03
2;spotrf;35;0;1;1;320;121;16000;16000;1714636915;3.651932e-01;3.739011e+03
3;spotrf;35;0;1;1;320;121;32000;32000;1957747793;3.038234e+00;3.595240e+03

Chameleon on two nodes

mpiexec -np 2 chameleon_stesting -o potrf -n 3200,6400,16000,32000,64000

gives

Id;Function;threads;gpus;P;Q;nb;uplo;n;lda;seedA;time;gflops
0;spotrf;35;0;1;2;320;121;3200;3200;846930886;1.988709e-02;5.494916e+02
1;spotrf;35;0;1;2;320;121;6400;6400;1681692777;4.519280e-02;1.933976e+03
2;spotrf;35;0;1;2;320;121;16000;16000;1714636915;2.688228e-01;5.079411e+03
3;spotrf;35;0;1;2;320;121;32000;32000;1957747793;2.119557e+00;5.153520e+03
4;spotrf;35;0;1;2;320;121;64000;64000;424238335;1.366869e+01;6.392958e+03

Try with different algorithms to compare

mpiexec -np 1 chameleon_stesting --nowarmup -o gemm -n 3200,6400,16000,32000
mpiexec -np 2 chameleon_stesting --nowarmup -o gemm -n 32000

mpiexec -np 1 chameleon_stesting --nowarmup -o symm -n 3200,6400,16000,32000
mpiexec -np 2 chameleon_stesting --nowarmup -o symm -n 32000

mpiexec -np 1 chameleon_stesting --nowarmup -o getrf -n 3200,6400,16000,32000
mpiexec -np 2 chameleon_stesting --nowarmup -o getrf -n 32000

mpiexec -np 1 chameleon_stesting --nowarmup -o geqrf -n 3200,6400,16000,32000
mpiexec -np 2 chameleon_stesting --nowarmup -o geqrf -n 32000

mpiexec -np 1 chameleon_stesting --nowarmup -o gels -n 3200,6400,16000,32000
mpiexec -np 2 chameleon_stesting --nowarmup -o gels -n 32000

Be aware that Chameleon follows a 2D block-cyclic distribution of the data over MPI processes which parameters can be handled by the user with the -P option. Depending on the matrices size, form and number of MPI processes, this parameter can play a key role over performances.

These testing drivers do not allow to check eigenvalue problems, i.e. gesvd and heevd, for now because they are not ready to be used in a MPI distributed context. Nevertheless usage of the gesvd and heevd algorithms in a shared memory context is fine. Getting good performances on several nodes on these eigenvalue problems is a future work in Chameleon.

2.3. Analyzing performances with execution traces

Lets observe now some execution traces thanks to StarPU and ViTE.

By default Chameleon is built without traces enabled (because it slows done a little bit the execution). To enable this feature one has to install StarPU with FxT component, see the documentation.

We choose to install it with Guix here

# remove the default version of chameleon
guix remove chameleon

# install a version with traces enabled
guix install chameleon --with-input=openblas=mkl --with-input=starpu=starpu-fxt starpu-fxt
# slurm reservation
salloc --nodes=2 --time=01:00:00 --constraint bora --exclusive --ntasks-per-node=1 --threads-per-core=1

# indicate where to generate profiling files
export STARPU_FXT_PREFIX=$PWD/

# Chameleon potrf execution
mpiexec -np 1 chameleon_stesting --nowarmup -o potrf -n 24000 -b 480 --trace

# generate the paje trace
starpu_fxt_tool -i prof_file_pruvost_* -o chameleon_spotrf.paje

# Chameleon posv execution
mpiexec -np 1 chameleon_stesting --nowarmup -o posv -n 24000 -b 480 --trace
starpu_fxt_tool -i prof_file_pruvost_* -o chameleon_sposv.paje

# Chameleon gemm execution
mpiexec -np 2 chameleon_stesting --nowarmup -o gemm -n 24000 -b 480 --trace
starpu_fxt_tool -i prof_file_pruvost_* -o chameleon_sgemm.paje

One can now vizualize the trace with ViTE

# on your laptop
sudo apt install vite

# get the paje file back to your laptop
scp plafrim:*.paje .

# load the file with ViTE
vite chameleon_spotrf.paje &
vite chameleon_sposv.paje &
vite chameleon_sgemm.paje &
Table 1: Traces comparison POTRF vs. POSV
POTRF POSV
chameleon_spotrf.png chameleon_sposv.png
Table 2: Trace GEMM
GEMM GEMM zoom
chameleon_sgemm.png chameleon_sgemm_2.png

2.4. Using Chameleon with GPUs

First install Chameleon with CUDA enabled, for example with Guix

guix remove chameleon starpu-fxt

guix install chameleon-cuda --with-input=openblas=mkl --with-input=starpu-cuda=starpu-cuda-fxt starpu-cuda-fxt
# slurm reservation
salloc --nodes=1 --time=01:00:00 --constraint sirocco --exclusive --ntasks-per-node=1 --threads-per-core=1

# indicate where to generate profiling files
export STARPU_FXT_PREFIX=$PWD/

# use ld preload for Guix to get path to nvidia drivers on the node with GPUs
export LD_PRELOAD="/usr/lib64/libcuda.so /usr/lib64/libnvidia-fatbinaryloader.so.440.33.01"

# Chameleon potrf execution
mpiexec -np 1 chameleon_stesting --nowarmup -o potrf -n 76800 -b 1920 -g 2 --trace

delivers something like 22 Tflop/s on 2 V100.

chameleon_spotrf_gpu.png

Figure 1: Execution trace of Chameleon spotrf N=32000, B=1600

2.5. Using Chameleon in your C/C++ project

2.6. Issues

Do not hesitate to submit issues by sending an email.

3. PaStiX sparse direct solver

The solver is currently evolving, and two versions are availables. The first historical version is available on the Inria forge. This version will become deprecated to the benefit of the current version under development, and publicly available on Gitlab. The main difference is the MPI support that is not yet fully integrated to the new PaStiX, but it includes all recent development on low-rank compression and runtime support for GPUs.

3.1. Installation

3.1.1. Using Guix

guix install openssh openmpi pastix --with-input=openblas=mkl

3.1.2. Using CMake

PaStiX needs several libraries to be compiled and installed on your cluster:

  • An implementation of the Message Passing Interface 2.0 standard like MPICH, Open MPI, …
  • An implementation of the Basic Linear Algebra Subroutines standard (BLAS) like OpenBLAS, ACML, INtel MKL, …
  • A library to compute ordering in order to reduce fill-in and increase parallelism like Scotch, PT-Scotch or Metis. The user can also provide it's own ordering.
  • The Hardware Locality library (HwLoc) is not required but highly recommended especially if PaStiX is compiled with threads support. It will give an uniform view of the architecture to the solver (Hide hyper-threading, linearize core numbering, …).
  • A runtime system PaRSEC or StarPU (optional). Installation example of StarPU has been done in the previous section about Chameleon.
# on your laptop
wget https://gitlab.inria.fr/solverstack/pastix//uploads/98ce9b47514d0ee2b73089fe7bcb3391/pastix-6.1.0.tar.gz
scp pastix-6.1.0.tar.gz plafrim:

ssh plafrim

# load the modules for dependencies
module purge
module add build/cmake/3.15.3
module add compiler/gcc/8.2.0
module add compiler/cuda/10.2
module add hardware/hwloc/2.1.0
module add mpi/openmpi/4.0.2
module add linalg/mkl/2019_update4
module add partitioning/scotch/int64/6.0.9
module add trace/eztrace/1.1-9
module add runtime/parsec/master/shm-cuda
module add trace/fxt/0.3.11
module add runtime/starpu/1.3.7/mpi-cuda-fxt

tar xvf pastix-6.1.0.tar.gz
cd pastix-6.1.0
# patch to apply to cmake_modules/morse_cmake/modules/find/FindPARSEC.cmake +567
#  list(APPEND REQUIRED_LIBS "${CUDA_CUBLAS_LIBRARIES};${CUDA_LIBRARIES}")
mkdir build && cd build

# configure, make, install
export PASTIX_ROOT=$PWD/install
cmake .. -DPASTIX_ORDERING_SCOTCH=ON -DPASTIX_WITH_PARSEC=ON -DPASTIX_WITH_STARPU=ON -DPASTIX_WITH_CUDA=ON -DPASTIX_WITH_MPI=ON -DPASTIX_WITH_EZTRACE=ON -DBLA_VENDOR=Intel10_64lp_seq -DBUILD_SHARED_LIBS=ON -DCMAKE_INSTALL_PREFIX=$PASTIX_ROOT
make -j5 install

# test
./example/simple --lap 1000 --threads 2

3.2. Using PaStiX to test sparse linear algebra performances

# slurm reservation
salloc --nodes=1 --time=01:00:00 --constraint sirocco --exclusive --ntasks-per-node=1 --threads-per-core=1

# check nodes obtained
mpiexec hostname

To run a simple example on a 2D Laplacian of size 1000

# test laplacian
mpiexec --map-by ppr:1:node:pe=40 $PASTIX_ROOT/examples/simple --lap 100000

3.2.1. Cholesky factorization

Perform Cholesky factorization on matrices, see parameters:

  • -f 0 to apply a Cholesky facorization
  • -s to choose the runtime as scheduler (1 for native pastix, 2 parsec, 3 starpu)
  • -g number of gpus
  • -i minimum block size to avoid too small blocks on the gpus
# static scheduling without gpus (700 GFlops/s)
mpiexec --map-by ppr:1:node:pe=40 $PASTIX_ROOT/examples/bench_facto --threads 40 -f 0 --mm /projets/matrix/GHS_psdef/audikw_1/audikw_1.mtx

# using parsec runtime with 1 or 2 gpus (900 GFlops/s and 1.1 TFlops/s)
mpiexec --map-by ppr:1:node:pe=40 $PASTIX_ROOT/examples/bench_facto --threads 40 -f 0 --mm /projets/matrix/GHS_psdef/audikw_1/audikw_1.mtx -s 2 -g 1 -i iparm_min_blocksize 920 -i iparm_max_blocksize 920
mpiexec --map-by ppr:1:node:pe=40 $PASTIX_ROOT/examples/bench_facto --threads 40 -f 0 --mm /projets/matrix/GHS_psdef/audikw_1/audikw_1.mtx -s 2 -g 2 -i iparm_min_blocksize 920 -i iparm_max_blocksize 920

See also these hints to optimize pastix with GPUs.

3.2.2. LU factorization

Perform LU factorization on matrices, see parameters:

  • -f 2 to apply a LU facorization
  • -s to choose the runtime as scheduler (1 for native pastix, 2 parsec, 3 starpu)
  • -g number of gpus
  • -i minimum block size to avoid too small blocks on the gpus
  • -d the compression low-rank precision criterion
# static scheduling without gpus (440 GFlops/s)
mpiexec --map-by ppr:1:node:pe=40 $PASTIX_ROOT/examples/bench_facto --threads 40 -f 2 --mm /projets/matrix/Bourchtein/atmosmodj/atmosmodj.mtx

# using parsec runtime with 2 gpus (1 TFlops/s)
mpiexec --map-by ppr:1:node:pe=40 $PASTIX_ROOT/examples/bench_facto --threads 40 -f 2 --mm /projets/matrix/Bourchtein/atmosmodj/atmosmodj.mtx -s 2 -g 2 -i iparm_min_blocksize 920 -i iparm_max_blocksize 920

# using compression low-rank (1 TFlops/s, considering full-rank kernels number of operations)
mpiexec --map-by ppr:1:node:pe=40 $PASTIX_ROOT/examples/bench_facto --threads 40 -f 2 --mm /projets/matrix/Bourchtein/atmosmodj/atmosmodj.mtx -i iparm_compress_when 2 -d dparm_compress_tolerance 1e-8

3.2.3. Generate traces with EZTrace

When StarPU is used we can generate FxT traces the same way we have seen in the chameleon section.

There is an alternative method, more generic, that works with any runtime, using EZtrace.

First EZtrace should be installed and -DPASTIX_WITH_EZTRACE=ON must be used during the project configuration.

Before execution we tell eztrace which kernels family we want to time, here pastix tasks:

# setup eztrace modules for traces
export EZTRACE_TRACE="kernels"
export EZTRACE_LIBRARY_PATH=$PWD/kernels

Perform the execution

mpiexec --map-by ppr:1:node:pe=40 $PASTIX_ROOT/examples/bench_facto --threads 40 -f 0 --mm /projets/matrix/GHS_psdef/audikw_1/audikw_1.mtx

Generate the trace, see kernels statistics

eztrace_stats -o output pruvost_eztrace_log_rank_1
eztrace_convert -o output pruvost_eztrace_log_rank_1
mv output.trace audikw_1.paje

This paje trace can be read with ViTE.

3.4. Issues

Do not hesitate to submit issues by sending an email.

4. MaPHyS sparse hybrid solver

4.1. Installation

4.1.1. Using Guix

guix install openssh openmpi maphys --with-input=mumps-openmpi=mumps-mkl-openmpi --with-input=openblas=mkl --with-input=lapack=mkl
export MAPHYS_ROOT=`guix package -I maphys |gawk '{print $4}'`

4.1.2. Using CMake

MaPHyS needs several libraries to be compiled and installed on your cluster:

  • An implementation of the Message Passing Interface 2.0 standard like MPICH, Open MPI, …
  • An implementation of the Basic Linear Algebra Subroutines standard (BLAS, LAPACK and ScaLAPACK) like OpenBLAS, ACML, Intel MKL, …
  • Scotch library to compute ordering in order to reduce fill-in and increase parallelism. The user can also provide it's own ordering.
  • The Hardware Locality library (HwLoc).
  • A sparse direct solver MUMPS or PaStiX (6.0.3 for now, compatibility issue).
# on your laptop
wget https://gitlab.inria.fr/solverstack/pastix//uploads/bd29252eb0390fa7bcb4e4d238c5b30a/pastix-6.0.3.tar.gz
wget http://morse.gforge.inria.fr/maphys/maphys-1.0.0.tar.gz
scp pastix-6.0.3.tar.gz plafrim:
scp maphys-1.0.0.tar.gz plafrim:

ssh plafrim

# load the modules for dependencies
module purge
module add build/cmake/3.15.3
module add compiler/gcc/8.2.0
module add hardware/hwloc/2.1.0
module add mpi/openmpi/4.0.2
module add linalg/mkl/2019_update4
module add partitioning/scotch/int64/6.0.9


# install pastix
tar xvf pastix-6.0.3.tar.gz
cd pastix-6.0.3
# patch to apply to cmake_modules/morse_cmake/modules/find/FindPARSEC.cmake +567
#  list(APPEND REQUIRED_LIBS "${CUDA_CUBLAS_LIBRARIES};${CUDA_LIBRARIES}")
mkdir build && cd build
export PASTIX_ROOT=$PWD/install
cmake .. -DPASTIX_ORDERING_SCOTCH=ON -DBLA_VENDOR=Intel10_64lp_seq -DBUILD_SHARED_LIBS=ON -DCMAKE_INSTALL_PREFIX=$PASTIX_ROOT
make -j5 install
export PKG_CONFIG_PATH=$PASTIX_ROOT/lib/pkgconfig:$PKG_CONFIG_PATH

# install maphys
cd $HOME
tar xvf maphys-1.0.0.tar.gz
cd maphys-1.0.0/
mkdir build && cd build
export MAPHYS_ROOT=$PWD/install
cmake .. -DMAPHYS_SDS_MUMPS=OFF -DMAPHYS_SDS_PASTIX=ON -DBLA_VENDOR=Intel10_64lp -DBUILD_SHARED_LIBS=ON -DCMAKE_INSTALL_PREFIX=$MAPHYS_ROOT
make -j5 install

4.2. Using MaPHyS drivers to test hybrid solvers behaviors

4.2.1. A first execution

# slurm reservation
salloc --nodes=2 --time=01:00:00 --constraint bora --exclusive

# check nodes obtained
mpiexec hostname

We will use the dmph_examplekv driver provided by maphys to do standalone testing (as opposed to maphys being used as a third-party library of an application. maphys supports four precisions and there is a specific driver <arithm>mph_examplekv for each of them, where <arith> may be:

  • s for single precision, real arithmetic;
  • d for double precision, real arithmetic;
  • c for single precision, complex arithmetic;
  • z for double precision, complex arithmetic.

The driver takes some input information via free structured files (we will construct a simple.in input file throughout this section). In order to launch these tests, you need to specify the matrix file (bcsstk17.mtx in this section) to process and the maphys solver options you want to set up in an input file whose names is given as argument. These options are described in the inputs files and the documentation. We are using =maphys= version 1.0.0, here are the associated refcard and users'guide. The name of the input file (simple.in) has to be given in argument.

We will play with the bcsstk17.mtx matrix from the SuiteSparse Matrix Collection. Read its properties on the website. Note that it is a real, symmetric positive definite (SPD) \(10,974\) by \(10,974\) matrix with \(428,650\) non zero elements, arising from a structural problem. This one is shipped with maphys but you may browse, download other matrices from the collection, and play with those instead.

Note that, based on the extension of the matrix file, the driver can take in charge:

  • '.ijv': coordinate format;
  • '.mm', or '.mtx': Matrix Market
  • '.rsa','.rua','.rse','.rue','.csa','.cua','.cse','.cue': Harwell boeing

The '.psa', '.pua', '.pse', '.pue' Harwell boeing extensions are not supported.

We want to instruct the maphys dmph_examplekv driver to run our sample matrix. Copy these lines in the file in simple.in.

MATFILE = bcsstk17.mtx
SYM = 2                # the symmetry of the matrix (0=general, 1=spd, 2=symmetric)
cp /gnu/store/8d22088kjch24r1wqlyvd3bxvj3c5klk-maphys-1.0.0/lib/maphys/examples/bcsstk17.mtx .
mpirun -n 2 $MAPHYS_ROOT/lib/maphys/examples/dmph_examplekv simple.in

In this run we can observe that we effectively run in real, double precision (Real (KIND=8): double precision). No right-hand side has been provided (no rhs: solution (1, 2, 3, ...) -> b), so the driver generates a solution vector (1, 2, 3, …) and computes the associated right-hand side through a matrix vector product.

4.2.2. An actual hybrid method

mpirun -n 4 $MAPHYS_ROOT/lib/maphys/examples/dmph_examplekv simple.in

Which iterative solver is being used? You can play with grep "CONVERGENCE HISTORY". iterative solver.

4.2.3. Exploit SPD properties

MATFILE = bcsstk17.mtx # matrix file
SYM = 1                # the symmetry of the matrix (0=general, 1=spd, 2=symmetric)

Which iterative solver iterative solver? Why? Which are the potential benefits (not representative for a small test case as considered for now).

4.2.4. Try running on more subdomains

As mentioned above, in the current maphys version, one process is associated to one domain. In the previous test case, we thus ran on four subdomains. We can try running on more subdomains.

As mentioned above, we are restricted to a power of 2 in this set up. If you run mpirun -n 5 ./dmph_examplekv simple.in, you'll get an error:

ERROR: MaPHyS partitioner only support *a power of 2* subdomains. nb subdomains=              5

Try a run on 16 domains:

mpirun -n 16 $MAPHYS_ROOT/lib/maphys/examples/dmph_examplekv simple.in

You shall get the following error:

ERROR: PART_scotch_partgraph while asserting cblknbr == 2*nbdom -1, cblknbr =             30

4.2.5. Use a robust algebraic domain decompostion

We'll use the Parallel Algebraic Domain Decomposition for Linear systEms (paddle) to perform the decomposition instead of the dummy internal default maphys algorithm.

Check out the maphys refcard. You'll see that you can activate paddle with ICNTL(49) = 2.

MATFILE = bcsstk17.mtx # matrix file
SYM = 1                # the symmetry of the matrix (0=general, 1=spd, 2=symmetric)
ICNTL(49) = 2          # algebraic domain decomposition algorithm (1=internal dummy algorithm, 2=robust paddle)

Try and run on 16 domains once again in this configuration.

mpirun -n 16 $MAPHYS_ROOT/lib/maphys/examples/dmph_examplekv simple.in

4.2.6. Backward and forward error analysis

Check out the value of the backward error (you may add | grep "Backward error" at the end of your mpirun command) of the previous execution. Is the stopping criterion satisfied knowing RCNTL(21) controls this threshold and that you may observe (if you did not play with that parameter)?:

Wanted precision               RCNTL (   21)  1.000E-05

Check out the direct error (you may add | grep "Computed (x) vs analytical (x*) solution" at the end of your mpirun command). How can you explain that? Check out the condition number of bcsstk17: you can view that online.

4.2.7. Sparse direct solver

Set up ICNTL(13) to run either pastix or mumps as internal direct solver.

4.2.8. Display more info

Set up ICNTL(5) and ICNTL(6) to view more information.

4.2.9. Sparsification

So far we were relying on the default preconditioning mechanism resulting to locally dense block diagonal matrices. Investigate the sparsified block diagonal factorization: the dense diagonal block is first sparsified by droppring entry \(s_{i,j}\) if it is lower than \(\xi ( \vert s_{i,i} \vert + \vert s_{j,j}\vert )\). An additional sparse factorization is then performed in this situation instead of a lapack dense factorization.

Enable sparsification (ICTNL(21)=2) and set up ξ=RCNTL(11)=1.0e-5, 1.0e-4, or 1.0e-3.

4.2.10. Force GMRES

  1. Normal

    Continue to let the solver know that the matrix is SPD (SYM=1) while forcing it to use (ICNTL(20)=1). Note that is certainly not a good thing to do in practice, but it is interesting to study.

  2. Orthogonalisation

    Change the orthogonalisation scheme (ICNTL(22)). Does it affect convergence? How?

4.2.11. Multiple right-hand sides (fabulous)

Turn on (ICNTL(20)=4) the block Krylov fabulous iterative solver (see more here) and possibly play with ICNTL(29): we are curious.

4.2.12. Larger testcase

Lets try with a larger matrix, the audikw_1

MATFILE=/projets/matrix/GHS_psdef/audikw_1/audikw_1.mtx
SYM=2                # the symmetry of the matrix (0=general, 1=spd, 2=symmetric)
ICNTL(13)=2          # to use pastix
ICNTL(49)=2          # to use paddle
RCNTL(21)=1.0e-9     # error tolerance on the solution

Execute MaPHyS using 2 nodes of 36 cores and 64 MPI processes (a power of 2) handling 1 subdomain each (default method):

salloc --nodes=2 --time=01:00:00 --constraint bora --exclusive --ntasks-per-node=32 --threads-per-core=1
mpirun $MAPHYS_ROOT/lib/maphys/examples/dmph_examplekv audikw_1.in

Do not converge after 100 iterations.

Description                                FIELD (INDEX)        MIN        MAX    AVERAGE   STD.DEV.
----------------------------------------------------------------------------------------------------
Timing -> Total                            RINFO (   21)  1.636E+02  1.636E+02  1.636E+02  3.983E-05
Timing -> Spent in all steps               RINFO (   20)  1.547E+02  1.631E+02  1.583E+02  1.772E+00
Timing -> Analyze                          RINFO (    4)  3.803E+01  3.808E+01  3.804E+01  1.035E-02
Timing -> Factorise                        RINFO (    5)  1.050E+00  3.283E+00  2.056E+00  4.397E-01
Timing -> Precond                          RINFO (    6)  2.350E+00  9.033E+00  4.951E+00  1.379E+00
Timing -> Solve                            RINFO (    7)  1.132E+02  1.132E+02  1.132E+02  1.376E-04

4.2.13. MPI + Threads

Add the following in the input file to test using MPI+Threads. It uses multithreaded PaStiX for sparse LU factorizations on the local systems and multithreaded LAPACK for dense LU factorizations on the interfaces.

Test with 2 subdomains:

ICNTL(42) = 1        # enables multithreading inside subdomains
ICNTL(37) = 2        # number of physical computing machines (nodes)
ICNTL(38) = 36       # number of CPU cores on each machine
ICNTL(39) = 36       # number of threads per MPI process
ICNTL(40) = 2        # number of MPI processes (also equal to the number of subdomains)

Execute MaPHyS using 2 nodes of 36 cores and 2 MPI processes (1 per node) handling 1 subdomain each. 36 threads are used in each MPI process.

salloc --nodes=2 --time=01:00:00 --constraint bora --exclusive --ntasks-per-node=1 --threads-per-core=1
mpirun $MAPHYS_ROOT/lib/maphys/examples/dmph_examplekv audikw_1.in

Do converge after 15 iterations.

Description                                FIELD (INDEX)        MIN        MAX    AVERAGE   STD.DEV.
----------------------------------------------------------------------------------------------------
Timing -> Total                            RINFO (   21)  1.120E+02  1.120E+02  1.120E+02  1.410E-05
Timing -> Spent in all steps               RINFO (   20)  1.069E+02  1.118E+02  1.094E+02  2.458E+00
Timing -> Analyze                          RINFO (    4)  6.573E+01  6.573E+01  6.573E+01  4.881E-04
Timing -> Factorise                        RINFO (    5)  2.239E+01  2.737E+01  2.488E+01  2.490E+00
Timing -> Precond                          RINFO (    6)  2.382E+00  2.445E+00  2.414E+00  3.152E-02
Timing -> Solve                            RINFO (    7)  1.635E+01  1.635E+01  1.635E+01  4.778E-04

Test with 4 subdomains:

ICNTL(37) = 2
ICNTL(38) = 36
ICNTL(39) = 18
ICNTL(40) = 4

Execute MaPHyS using 2 nodes of 36 cores and 4 MPI processes (2 per nodes) handling 1 subdomain each. 18 threads are used in each MPI process.

salloc --nodes=2 --time=01:00:00 --constraint bora --exclusive --ntasks-per-node=2 --threads-per-core=1
mpirun $MAPHYS_ROOT/lib/maphys/examples/dmph_examplekv audikw_1.in

Do converge after 30 iterations.

Description                                FIELD (INDEX)        MIN        MAX    AVERAGE   STD.DEV.
----------------------------------------------------------------------------------------------------
Timing -> Total                            RINFO (   21)  8.062E+01  8.062E+01  8.062E+01  1.838E-05
Timing -> Spent in all steps               RINFO (   20)  7.212E+01  7.897E+01  7.688E+01  2.802E+00
Timing -> Analyze                          RINFO (    4)  4.911E+01  4.912E+01  4.912E+01  6.281E-03
Timing -> Factorise                        RINFO (    5)  1.072E+01  1.573E+01  1.364E+01  1.819E+00
Timing -> Precond                          RINFO (    6)  1.726E+00  5.138E+00  3.555E+00  1.222E+00
Timing -> Solve                            RINFO (    7)  1.057E+01  1.057E+01  1.057E+01  1.168E-03

Test with 8 subdomains:

ICNTL(37) = 2
ICNTL(38) = 36
ICNTL(39) = 9
ICNTL(40) = 8

Execute MaPHyS using 2 nodes of 36 cores and 8 MPI processes (4 per nodes) handling 1 subdomain each. 9 threads are used in each MPI process.

salloc --nodes=2 --time=01:00:00 --constraint bora --exclusive --ntasks-per-node=4 --threads-per-core=1
mpirun $MAPHYS_ROOT/lib/maphys/examples/dmph_examplekv audikw_1.in

Do converge after 52 iterations.

Description                                FIELD (INDEX)        MIN        MAX    AVERAGE   STD.DEV.
----------------------------------------------------------------------------------------------------
Timing -> Total                            RINFO (   21)  7.836E+01  7.836E+01  7.836E+01  2.204E-05
Timing -> Spent in all steps               RINFO (   20)  6.770E+01  7.832E+01  7.289E+01  3.462E+00
Timing -> Analyze                          RINFO (    4)  4.155E+01  4.158E+01  4.157E+01  1.017E-02
Timing -> Factorise                        RINFO (    5)  5.648E+00  1.013E+01  7.970E+00  1.505E+00
Timing -> Precond                          RINFO (    6)  2.133E+00  8.242E+00  4.982E+00  2.057E+00
Timing -> Solve                            RINFO (    7)  1.837E+01  1.837E+01  1.837E+01  9.937E-04

Same situation but using dense factorization on the local Schur systems (instead of sparse with threshold at 1e-4). Add ICNTL(21) = 1 in the input file. It converges in 45 iterations instead of 52.

Description                                FIELD (INDEX)        MIN        MAX    AVERAGE   STD.DEV.
----------------------------------------------------------------------------------------------------
Timing -> Total                            RINFO (   21)  7.333E+01  7.333E+01  7.333E+01  1.526E-05
Timing -> Spent in all steps               RINFO (   20)  5.565E+01  7.245E+01  6.262E+01  5.843E+00
Timing -> Analyze                          RINFO (    4)  4.164E+01  4.167E+01  4.165E+01  1.010E-02
Timing -> Factorise                        RINFO (    5)  5.653E+00  1.027E+01  7.855E+00  1.538E+00
Timing -> Precond                          RINFO (    6)  1.000E+00  1.399E+01  5.752E+00  4.457E+00
Timing -> Solve                            RINFO (    7)  7.361E+00  7.364E+00  7.361E+00  8.887E-04

4.2.14. Coarse Grain Correction

You could also try using the coarse grid correction method (CGC), ICNTL(21)=10, which improves the convergence when many subdomains are used. Here because the matrix is rather small CGC is not usefull.

4.3. Using MaPHyS in your C/Fortran code (soon C++)

4.4. Issues

Do not hesitate to submit issues by sending an email.

Author: Florent Pruvost

Created: 2024-06-27 Thu 12:48

Validate