Testing applications
We strongly recommend you test your applications and settings using salloc
before submitting large production jobs with sbatch
. This way you can avoid wasting resources and get faster feedback on potential errors.
We offer the devel
partition for quick job tests. It is comprised of a few nodes that allow multiple jobs to run simultaneously and share the nodes. The devel
partition has a shorter time limit to ensure it is only used for job tests. The devel
partition is not intended for performance tests, since cores may be shared.
There are multiple ways to test your program:
Interactive (recommended)
The
salloc
command allows you to request a job allocation (CPU + GPU resources) similar to thesbatch
command, but it automatically redirects your console to primary job node in an interactive shell.
This makes interactive tests very easy. You can also use the devel partition with 2 or less nodes and a problem size expected to finish in 60 minutes or less.Batch
You can submit a batch job to the
devel
partition that is configured for 2 or less nodes with a problem size expected to finish in 60 minutes or less.
Interactive: How to use salloc
Overview:
- To test a job interactively, log into a login node and request resources using
salloc
(help.itc, man) for a couple hours on normal partitions or 60 minutes max on devel. - These resources (node, cores, memory, gpu, etc ) will be available for the duration requested, regardless of program crashes or failures.
- Once the resources are available the console is automatically redirected (ssh) to the head compute node of the requested resources.
- The console can then be used to load modules (compilers, libraries, programs), define environment variables, change directories, etc.
- If a batch file is already available, the environment can be configured by running all the user (excluding #SBATCH) commands that would be within a batch file. Alternatively you can make the file executable and run it directly. The #SBATCH directives will be treated as comments and therefore be ignored.
- If a failure occurs, corrections to the environment can be made and the program can be re-run.
- Once the program runs as intended, the
salloc
allocation can be stopped by typingexit
in the console, or canceling the slurm job id.
To use salloc
, the resources (nodes, core/cpu, mem, etc) have to be passed directly as arguments. The command works similar to sbatch
, in that it accepts all the same arguments.
The HPC program itself is not given to salloc
, but is rather run as a console command once salloc
returns with the resources.
One example to request resources:
salloc -p <partition> -A <proj_account> --time=hh:mm:ss -N <num_nodes> -n <num_cpus> --ntasks-per-node=<num_cpus_per_node> --mem-per-cpu=<mem_per_cpu> --gres=gpu:<type>:<num_gpus>
You may omit or add certain options according to your own needs.
For testing on the devel partition, you are limited to 60 minutes and must not specify --account
/ --A
.
The following example shows how the process looks like. Please adapt this to your own application based on its own documentation.
Log in to a login node. From the login node:
salloc -p devel --time=01:00:00 -N 2 -n 8 --ntasks-per-node=4 --mem-per-cpu=3900M
which should give output similar to:
salloc: Pending job allocation 1774242
salloc: job 1774242 queued and waiting for resources
salloc: job 1774242 has been allocated resources
salloc: Granted job allocation 1774242
It will automatically redirect to the head compute node after waiting some time. From the head compute node:
# Load modules you need for your environment module load ... # Set needed variables MY_ARG="--wrong-argument"
srun hostname $MY_ARG
If it causes an error:
hostname unrecognized option '--wrong-argument'
You are still connected and can try again:
MY_ARG="--all-ip-addresses"
srun hostname $MY_ARG
Which will be successful:
123.12.34.56.78
To then exit the allocation of resources:
exit
Will release the requested resources and automatically redirect back to the login node:
salloc: Relinquishing job allocation 1774242
salloc: Job allocation 1774242 has been revoked.
Note
salloc
a command to execute, you have to prepend srun
to the command. For instance:salloc --nodes 2 --ntasks-per-node 2 srun -n 4 hostname
Custom and modified shell environments (like automatically loaded conda envs) that change PATHs and other default settings will probably not work with salloc and wont be usable within sbatch batch jobs!
Batch: How to use the devel
partition with a batch job
Running batch jobs on the devel
partition can be as simple as adding one Slurm directive to your job script:
#!/usr/bin/zsh
### Name the job
#SBATCH --job-name=MY_JOB
### Declare the merged STDOUT/STDERR file
#SBATCH --output=output.%J.txt
### Ask for eight tasks (MPI Ranks)
#SBATCH --ntasks=8
### Submit the job to the devel partition for testing
#SBATCH --partition=devel
### Ask for one node, use several nodes in case you need additional resources
#SBATCH --nodes=1
### You should reserve less memory on the testing nodes unless you require more
#SBATCH --mem-per-cpu=1000M
### Beginning of executable commands
srun hostname
By adding the #SBATCH --partition
directive you instruct Slurm to send your job to the devel partition. Once you have finished your testing, you can remove it or replace the devel partition with another one (i.e: c23ms, c23g).
Note
- You must not set a project (i.e.
#SBATCH --account=...
) when submitting test jobs. - The testing nodes are not equipped with GPUs or other accelerators.
- The maximum runtime as of now is 60 minutes. Your jobs will be killed once the limit is reached.
Custom and modified shell environments (like automatically loaded conda envs) that change PATHs and other default settings will probably not work with salloc and wont be usable within sbatch batch jobs!
Resources and waiting times
Waiting times for interactive salloc
sessions still apply, just like with sbatch
.
To test applications interactively with salloc
, the requested resources need to be small enough so that the waiting time is low, but big enough so that the correctness can be asserted.
Please use a small problem size / input for your program, so that the runtime is not too long! i.e: smaller matrix, fewer iterations, fewer timesteps, less files, etc.
We recommend the following resource requests (the only drawback of requesting more is the waiting time):
Program | Recommended Maximum! | Notes | ||||
---|---|---|---|---|---|---|
Type | Nodes | Cores | Memory per core | Memory Total | Runtime | < use more only IF needed |
MPI | 2 | 8 (4+4) | 2GB | 16GB | 1h | 4 cores per node |
Multithreaded (MT) | 1 | 8 | 2GB | 16GB | 1h | no more than 1 node |
Hybrid MPI + MT | 2 | 16 (8+8) | 1GB | 16GB | 1h | 8 cores per node |
GPU | 1 | 24 | 16GB | 16GB | 1h | might need more memory |
Multi-Process (not MPI) | 1 | 4 | 4GB | 16GB | 1h | use same node if possible |
Single-Threaded / Other | 1 | 1 / 2 | 8GB | 16GB | 1h |