IT Center Help

You are located in service: RWTH High Performance Computing (Linux)

GPU and MPI Slurm Jobs

Kurzinformation

Here you will find more example job scripts for the Slurm batch system.

Detailinformation

All examples run a maximum of 15 minutes.
Copy the contents of any example to a new file in the HPC file systems.
You can then run the example with sbatch filename.sh .

Serial

A sample program hostname that does not use MPI:

#!/usr/bin/zsh

### Maximum runtime before programs are killed by Slurm
### Your programs should end before this!
#SBATCH --time=00:15:00

### Request memory per cpu
### Most CLAIX-2023 nodes have less than 2.7 GB per cpu.
### M is the default, but could also be K(ilo)|G(iga)|T(era)
#SBATCH --mem-per-cpu=2500M

### Name the job
#SBATCH --job-name=SERIAL_JOB

### Declare the merged STDOUT/STDERR file
#SBATCH --output=output.%J.txt

### Begin of executable commands
hostname

OpenMP

A sample program ./a.out that runs on 1 single node, only uses OpenMP multithreading and requires 8 CPUS and 20 GB Memory:


#!/usr/bin/zsh

### Maximum runtime
#SBATCH --time=00:15:00

### Ask for 8 cpus (tasks in Slurm terms) in the same node
#SBATCH --ntasks=1         ### 8*1 = 8 CPUS
#SBATCH --cpus-per-task=8  ### 8*1 = 8 CPUS
#SBATCH --nodes=1

#ask for 20GB within each node (1)
#SBATCH --mem=20GB

### Name the job
#SBATCH --job-name=OPENMP_JOB

### Declare an output STDOUT/STDERR file
#SBATCH --output=output.%J.txt

### Beginning of your programs
### Note: The OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK envvar is set automatically by us
###       for you as a convenience. Expert users can modify it:
### OMP_NUM_THREADS=

./a.out

MPI

A sample program ./a.out that uses MPI parallelism to run 8 Ranks on 8 CPUs on 2 nodes:


#!/usr/bin/zsh

### Maximum runtime
#SBATCH --time=00:15:00

### Ask for 8 Task/CPUs/MPI Ranks
#SBATCH --ntasks=8

### Ask for 1 node, use more nodes if you need them
#SBATCH --nodes=2

### Ask for less than 2.66 GB memory per CPU/Task/MPI rank
### M is default, could be K(ilo)|G(iga)|T(era)
#SBATCH --mem-per-cpu=2500M   

### Name the job
#SBATCH --job-name=MPI_JOB

### Declare the merged STDOUT/STDERR file
#SBATCH --output=output.%J.txt

### Beginning of executable commands
$MPIEXEC $FLAGS_MPI_BATCH ./a.out

Hybrid

This example illustrates a batch script for an MPI + OpenMP hybrid job, which uses 2 CLAIX-2018 nodes with 4 ranks, 2 ranks per node, 1 MPI rank per socket and 24 OMP threads per socket. This uses all 48 cores per node.


#!/usr/bin/zsh

### Maximum runtime
#SBATCH --time=00:15:00

### Ask for four tasks (which are 4 MPI ranks)
#SBATCH --ntasks=4

### Ask for 24 threads per task=MPI rank (which is 1 thread per core on one socket on CLAIX18)
#SBATCH --cpus-per-task=24

#################
# ATTENTION !!! #
#################
# Divide the needed memory per task through the cpus-per-task, as slurm requests memory per cpu, not per task !
# Example:
# You need 24 GB memory per task, you have 24 cpus per task ordered
# order 24GB/24 -> 1G memory per cpu (i.e., per thread)
#SBATCH --mem-per-cpu=1G

### Name the job
#SBATCH --job-name=HYBRID_JOB

### Declare the merged STDOUT/STDERR file
#SBATCH --output=output.%J.txt

### Beginning of executable commands
### Note: the OMP_NUM_THREADS envvar is set automatically - do not overwrite!

$MPIEXEC $FLAGS_MPI_BATCH ./a.out

Simple GPU Example

Please Note: Loading a CUDA module may require loading additional modules. Check the output of module spider CUDA/<version> for information.

Run deviceQuery (from NVIDIA SDK) on one device:


#!/usr/bin/zsh

### Maximum runtime
#SBATCH --time=00:15:00

#SBATCH -J gpu_serial
#SBATCH -o gpu_serial.%J.log
#SBATCH --gres=gpu:1

module load CUDA

# Print some debug information
echo; export; echo; nvidia-smi; echo

$CUDA_ROOT/extras/demo_suite/deviceQuery -noprompt

MPI + GPU

To run an MPI application on the GPU nodes, you need to take special care to correctly set the numer of MPI ranks per node. Typical setups are:

One process per node (ppn):

If your process uses both GPUs at the same time, e.g. via cudaSetDevice or by accepting CUDA_VISIBLE_DEVICES (set automatically by the batch system). Let ./cuda-mpi be the path to your MPI-compatible CUDA program:


#!/usr/bin/zsh

### Maximum runtime
#SBATCH --time=00:15:00

### Setup in this script:
### - 4 nodes (c18g)
### - 1 rank per node
### - 2 GPUs per rank (= both GPUs from the node)
#SBATCH -J 4-1-2
#SBATCH -o 4-1-2.%J.log
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2

module load CUDA

# Print some debug information
echo; export; echo; nvidia-smi; echo

$MPIEXEC $FLAGS_MPI_BATCH ./cuda-mpi

two processes per node (ppn):

If each process communicates to its own single GPU and thus using both GPUs on a node (recommended setup). Let ./cuda-mpi be the path to your MPI-compatible CUDA program:


#!/usr/bin/zsh

### Maximum runtime
#SBATCH --time=00:15:00

### Setup in this script:
### - 2 nodes (c18g, default)
### - 2 ranks per node
### - 1 GPU per rank (= both GPUs from the node)

#SBATCH -J 2-2-1
#SBATCH -o 2-2-1.%J.log
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2

module load CUDA

#print some debug informations...
echo; export; echo; nvidia-smi; echo

$MPIEXEC $FLAGS_MPI_BATCH ./cuda-mpi

More than 2 processes per node:
If you also have processes that do computation on the CPU only.

Hybrid toy example for download

Please find a hybrid Fortran toy code, Makefile and Slurm job script for download here. You need to adjust your project account, your working directory and probably your job log file name in slurm.job.

Download

hybrid-slurm-example.tar

extract using the command


tar -xf hybrid-slurm-example.tar

edit and adjust the file slurm.job

Compile with: make compile

and submit with: make submit

and check the job log file after the job has terminated.

last changed on 02/23/2024

How did this content help you?

This work is licensed under a Creative Commons Attribution - Share Alike 3.0 Germany License