IT Center Help

Sie befinden sich im Service: RWTH High Performance Computing (Linux)

Testing applications

Kurzinformation

We strongly recommend you test your applications and settings using salloc before submitting large production jobs with sbatch. This way you can avoid wasting resources and get faster feedback on potential errors.

We offer the devel partition for quick job tests. It is comprised of a few nodes that allow multiple jobs to run simultaneously and share the nodes. The devel partition has a shorter time limit to ensure it is only used for job tests. The devel partition is not intended for performance tests, since cores may be shared.

There are multiple ways to test your program:

Interactive (recommended)
The salloc command allows you to request a job allocation (CPU + GPU resources) similar to the sbatch command, but it automatically redirects your console to primary job node in an interactive shell.
This makes interactive tests very easy. You can also use the devel partition with 2 or less nodes and a problem size expected to finish in 60 minutes or less.
Batch
You can submit a batch job to the devel partition that is configured for 2 or less nodes with a problem size expected to finish in 60 minutes or less.

Detailinformation

Interactive: How to use salloc

Overview:

To test a job interactively, log into a login node and request resources using salloc (help.itc, man) for a couple hours on normal partitions or 60 minutes max on devel.
These resources (node, cores, memory, gpu, etc ) will be available for the duration requested, regardless of program crashes or failures.
Once the resources are available the console is automatically redirected (ssh) to the head compute node of the requested resources.
The console can then be used to load modules (compilers, libraries, programs), define environment variables, change directories, etc.
If a batch file is already available, the environment can be configured by running all the user (excluding #SBATCH) commands that would be within a batch file. Alternatively you can make the file executable and run it directly. The #SBATCH directives will be treated as comments and therefore be ignored.
If a failure occurs, corrections to the environment can be made and the program can be re-run.
Once the program runs as intended, the salloc allocation can be stopped by typing exit in the console, or canceling the slurm job id.

To use salloc, the resources (nodes, core/cpu, mem, etc) have to be passed directly as arguments. The command works similar to sbatch, in that it accepts all the same arguments.
The HPC program itself is not given to salloc, but is rather run as a console command once salloc returns with the resources.

One example to request resources:

salloc -p <partition> -A <proj_account> --time=hh:mm:ss -N <num_nodes> -n <num_cpus> --ntasks-per-node=<num_cpus_per_node> --mem-per-cpu=<mem_per_cpu> --gres=gpu:<type>:<num_gpus>

You may omit or add certain options according to your own needs.
For testing on the devel partition, you are limited to 60 minutes and must not specify --account / --A.

The following example shows how the process looks like. Please adapt this to your own application based on its own documentation.

salloc -p devel --time=01:00:00 -N 2 -n 8 --ntasks-per-node=4 --mem-per-cpu=3900M

which should give output similar to:

salloc: Pending job allocation 1774242
salloc: job 1774242 queued and waiting for resources
salloc: job 1774242 has been allocated resources
salloc: Granted job allocation 1774242

It will automatically redirect to the head compute node after waiting some time. From the head compute node:

# Load modules you need for your environment
module load ...

# Set needed variables
MY_ARG="--wrong-argument"

srun hostname $MY_ARG

If it causes an error:

hostname unrecognized option '--wrong-argument'

You are still connected and can try again:

MY_ARG="--all-ip-addresses"  
srun hostname $MY_ARG

Which will be successful:

123.12.34.56.78

To then exit the allocation of resources:

exit

Will release the requested resources and automatically redirect back to the login node:

salloc: Relinquishing job allocation 1774242
salloc: Job allocation 1774242 has been revoked.

Note

If you specify to salloc a command to execute, you have to prepend srun to the command. For instance:

salloc --nodes 2 --ntasks-per-node 2 srun -n 4 hostname

Otherwise the command will be executed on the host where you called salloc.

Custom and modified shell environments (like automatically loaded conda envs) that change PATHs and other default settings will probably not work with salloc and wont be usable within sbatch batch jobs!

Batch: How to use the devel partition with a batch job

Running batch jobs on the devel partition can be as simple as adding one Slurm directive to your job script:

#!/usr/bin/zsh

### Name the job
#SBATCH --job-name=MY_JOB

### Declare the merged STDOUT/STDERR file
#SBATCH --output=output.%J.txt


### Ask for eight tasks (MPI Ranks)
#SBATCH --ntasks=8

### Submit the job to the devel partition for testing
#SBATCH --partition=devel

### Ask for one node, use several nodes in case you need additional resources
#SBATCH --nodes=1

### You should reserve less memory on the testing nodes unless you require more
#SBATCH --mem-per-cpu=1000M

### Beginning of executable commands
srun hostname

By adding the #SBATCH --partition directive you instruct Slurm to send your job to the devel partition. Once you have finished your testing, you can remove it or replace the devel partition with another one (i.e: c23ms, c23g).

Note

You must not set a project (i.e. #SBATCH --account=...) when submitting test jobs.
The testing nodes are not equipped with GPUs or other accelerators.
The maximum runtime as of now is 60 minutes. Your jobs will be killed once the limit is reached.
Custom and modified shell environments (like automatically loaded conda envs) that change PATHs and other default settings will probably not work with salloc and wont be usable within sbatch batch jobs!

Resources and waiting times

Waiting times for interactive salloc sessions still apply, just like with sbatch.

To test applications interactively with salloc, the requested resources need to be small enough so that the waiting time is low, but big enough so that the correctness can be asserted.

Please use a small problem size / input for your program, so that the runtime is not too long! i.e: smaller matrix, fewer iterations, fewer timesteps, less files, etc.

We recommend the following resource requests (the only drawback of requesting more is the waiting time):

Program	Recommended Maximum!					Notes
Type	Nodes	Cores	Memory per core	Memory Total	Runtime	< use more only IF needed
MPI	2	8 (4+4)	2GB	16GB	1h	4 cores per node
Multithreaded (MT)	1	8	2GB	16GB	1h	no more than 1 node
Hybrid MPI + MT	2	16 (8+8)	1GB	16GB	1h	8 cores per node
GPU	1	24	16GB	16GB	1h	might need more memory
Multi-Process (not MPI)	1	4	4GB	16GB	1h	use same node if possible
Single-Threaded / Other	1	1 / 2	8GB	16GB	1h

zuletzt geändert am 01.07.2024

Wie hat Ihnen dieser Inhalt geholfen?

Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Deutschland Lizenz