We strongly recommend testing your applications and settings before submitting large production jobs (NOT performance tests). This way you can avoid wasting resources and get faster feedback on potential errors.
We offer the
devel partition for quick job tests. It is comprised of a few nodes that allow multiple jobs to run simultaneously and share the nodes. The
devel partition has a shorter time limit to ensure it is only used for job tests. The
devel partition is not intended for performance tests, since cores may be shared.
There are multiple ways to test your program:
You can submit a batch job to the
develpartition that is configured for 8 or less nodes ( 384 CPUs/Ranks ) with a problem size expected to finish in 25 minutes or less.
salloccommand allows you to request a job allocation similar to the
sbatchcommand but lets you enter the primary job node in an interactive shell. This makes interactive tests very easy and supports nodes with GPUs and other special hardware. You can also use the devel partition with 8 or less nodes and a problem size expected to finish in 25 minutes or less.
Batch: How to use the
devel partition with a batch job
Running batch jobs on the
devel partition can be as simple as adding one Slurm directive to your job script. Consider the following script based on our MPI example batch script:
#!/usr/bin/zsh ### Name the job #SBATCH --job-name=MPI_JOB ### Declare the merged STDOUT/STDERR file #SBATCH --output=output.%J.txt ### Ask for eight tasks (MPI Ranks) #SBATCH --ntasks=8 ### Submit the job to the devel partition for testing #SBATCH --partition=devel ### Ask for one node, use several nodes in case you need additional resources #SBATCH --nodes=1 ### You should reserve less memory on the testing nodes unless you require more #SBATCH --mem-per-cpu=1000M ### Beginning of executable commands $MPIEXEC $FLAGS_MPI_BATCH ./a.out
By adding the
#SBATCH --partition directive you instruct Slurm to send your job to the devel partition. Once you have finished your testing, you can remove it or replace the devel partition with another one (i.e: c18m, c18g).
- You must not set a project (i.e.
#SBATCH --account=...) when submitting test jobs.
- The testing nodes are not equipped with GPUs or other accelerators.
- The maximum runtime as of now is 25 minutes. Your jobs will be killed once the limit is reached.
Interactive: How to use salloc + devel partition
- To test interactively, log into a login node and request resources using
salloc(help.itc, man) for a couple hours.
- These resources (node, cores, memory, gpu, etc ) will be available for the duration requested, regardless of program crashes or failures.
- Once the resources are available the console is automatically redirected (ssh) to the head compute node of the requested resources.
- The console can then be used to load modules (compilers, libraries, programs), define environment variables, change directories, etc.
- If a batch file is already available, the environment can be configured by running all the user (excluding #SBATCH) commands that would be within a batch file. Alternatively you can make the file executable and run it directly. The #SBATCH directives will be treated as comments and therefore be ignored.
- If a failure occurs, corrections to the environment can be made and the program can be re-run.
- Once the program runs as intended, the
sallocallocation can be stopped by typing
exitin the console, or canceling the slurm job id.
salloc, the resources (nodes, core/cpu, mem, etc) have to be passed directly as arguments. The command works similar to
sbatch, in that it accepts all the same arguments.
The HPC program itself is not given to
salloc, but is rather run as a console command once
salloc returns with the resources.
One example to request resources:
salloc -p <partition> -A <proj_account> --time=hh:mm:ss -N <num_nodes> -n <num_cpus> --ntasks-per-node=<num_cpus_per_node> --mem-per-cpu=<mem_per_cpu> --gres=gpu:<type>:<num_gpus>
You may omit or add certain options according to your own needs.
For testing on the devel partition, you are limited to 00:25:00 minutes and must not specify
Please Note: The following is just an example for MPI! If your application does not use MPI, the part
$MPIEXEC is not needed! Please adapt this to your own application based on its own documentation.
Log in to a login node. From the login node:
salloc -p devel--time=00:25:00 -N 2 -n 8 --ntasks-per-node=4 --mem-per-cpu=3900M
which should give output similar to:
salloc: Pending job allocation 1774242 salloc: job 1774242 queued and waiting for resources salloc: job 1774242 has been allocated resources salloc: Granted job allocation 1774242
It will automatically redirect to the head compute node after waiting some time. From the head compute node:
# Load modules you need for your environment module load ... # Set needed variables MY_ARG=79 $MPIEXEC ./a.out $MY_ARG
If it causes an error:
a.out line 3: Segmentation Fault, 79 is out of range.
You are still connected and can try again:
MY_ARG=42 $MPIEXEC ./a.out $MY_ARG
Which may eventually be successful:
To then exit the allocation of resources:
Will release the requested resources and automatically redirect back to the login node:
salloc: Relinquishing job allocation 1774242 salloc: Job allocation 1774242 has been revoked.
Resources and waiting times
Waiting times for interactive
salloc sessions still apply, just like with
To test applications interactively with
salloc, the requested resources need to be small enough so that the waiting time is low, but big enough so that the correctness can be asserted.
Please use a small problem size / input for your program, so that the runtime is not too long! i.e: smaller matrix, fewer iterations, fewer timesteps, less files, etc.
We recommend the following resource requests (the only drawback of requesting more is the waiting time):
|Type||Nodes||Cores||Memory per core||Memory Total||Runtime||< use more only IF needed|
|MPI||2||8 (4+4)||2GB||16GB||devel: 25m / other: 1h||4 cores per node|
|Multithreaded (MT)||1||8||2GB||16GB||devel: 25m / other: 1h||no more than 1 node|
|Hybrid MPI + MT||2||16 (8+8)||1GB||16GB||devel: 25m / other: 1h||8 cores per node|
|GPU||1||1||16GB||16GB||might need more memory|
|Multi-Process (not MPI)||1||4||4GB||16GB||devel: 25m / other: 1h||use same node if possible|
|Single-Threaded / Other||1||1 / 2||8GB||16GB||devel: 25m / other: 1h|