Some SLURM Commands
- Job Submission
- Job script skeleton
- Job Cancellation
- Job Monitoring
- Job Accounting
- Partition State
- Basic Job Parameters
- Submitting with a project
- Submitting a GPU job
- Submitting to specific partition
- BeeOND (BeeGFS On-Demand)
With SLURM you need a batch job script. We strongly recommend to not use the commandline but write a batch script instead. Support is much more difficult if we do not have a batchscript, so it might take much longer until an issue can be solved.
A job using a script called "jobscript.sh" can be submitted like this:
$ sbatch < jobscript.sh
You can use this simple skeleton as a start for your jobscript.
### #SBATCH directives need to be in the first part of the jobscript
### your code goes here, the second part of the jobscript
### !!! DON'T MIX THESE PARTS !!!
|Please remark, that the aforementioned SHEBANG (first! line of the jobscript) is the only supported shebang. If you have a problem with your script, which cannot be reproduced with the ZSH, don't expect to get a thorough analysis of the problem.|
$ scancel <jobid>
squeue - display running and waiting jobs. You can see a list of your own queued and running jobs like this:
$ squeue -u $USER
|$ squeue - u $USER --start|
in order to report the expected start time and resources to be allocated for pending jobs. Please note that this start time is not guaranteed and might be changed due to high priority jobs or job backfilling.
spart - shows user specific partition information with core count of available nodes and pending jobs. Please find detailed information here.
QUEUE STA FREE TOTAL RESORC OTHER FREE TOTAL || DEF MEM CORES NODE
PARTITION TUS CORES CORES PENDNG PENDNG NODES NODES || GB/CPU /NODE MEM-GB
c18m 5359 59520 242 873 1 1240 || 3 48 187
c18m_low 5359 59520 2432 0 1 1240 || 3 48 187
c18g 738 2592 1416 7592 2 54 || 3 48 187
c18g_low 738 2592 0 0 2 54 || 3 48 187
c16m 9240 14592 0 1200 385 608 || 5 24 124
c16m_low 9240 14592 0 0 385 608 || 5 24 124
c16s 1152 1152 0 0 8 8 || 7 144 1020
c16s_low 1152 1152 0 0 8 8 || 7 144 1020
c16g 160 216 0 0 6 9 || 5 24 124
c16g_low 160 216 0 0 6 9 || 5 24 124
ind 1952 1952 0 0 44 44 || 1 40 61
ih 10414 14604 5313 1237 207 414 || 1 1 0
YOUR PEND PEND YOUR DEFAULT MAXIMUM
RUN RES OTHR TOTL JOB-TIME JOB-TIME
COMMON VALUES: 0 0 0 0 15 mins 30 days
r_wlm_usage - the SLURM version of r_batch_usage. To see old LSF usage, please use r_batch_usage.
The following is an example output for pseudo project rwth1234 with "r_wlm_usage -p rwth1234 -q"
Start of Accounting Period: 01.05.2018
End of Accounting Period: 01.05.2019
State of project: active
Quota monthly (core-h): 200000
Total quota (core-h): 2.400 Mio
Remaining core-h of prev. month: 0
Consumed core-h current month: 0
Consumable core-h (%): 200
Consumable core-h: 200000
Default partition: c18m
Allowed partitions: c18m,c18g
Max. allowed wallclocktime: 1.0 days
Please also see SLURM Accounting on how jobs are accounted with SLURM.
$ sinfo -s
This shows you the available partitions, e.g.
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
c18m up 5-00:00:00 289/398/345/1032 ncm[0001-1032]
c18g up 5-00:00:00 3/45/0/48 ncg[01-48]
You see the partiton name, the state of the partition, the maximum wallclock time for a job in the partition, the state of the nodes and the nodelist. The nodestates are (in contrast to the manpage) (A)llocated/(I)dle/(O)ther/(T)otal. So, take the second column of the notestates into consideration if you are looking for free hosts.
short and long parameters
Each option can be specified using either a short or a long notation.
Please remark that the shortform expects a blank after the parameter while the long form expects a '='.
The SLURM parser stops parsing the #SBATCH magic cookies if it hits a line with a 'normal' command, i.e. a line that doesn't start with these exact characters.
Should you experience batch arguments being ignored by SLURM, please also double-check for spelling mistakes in both your parameters and arguments.
Coming across expressions like "--accnt=rwth0000" or "--output=namewith blanks.%J" will cause the SLURM parser to terminate and ignore subsequent #SBATCH statements.
Be advised that indentation of #SBATCH parameters is not supported (see above).
-c, --cpus-per-task=<numcpus> for OpenMP/Hybrid -n, --ntasks=<numtasks> for Processes/MPI --ntasks-per-node=<numtasks> span[ptile=<numtasks>] -N, --nodes=<numnodes> span[hosts=<numnodes>]
--mem-per-cpu=<size> #memory needed per allocated CPU, which can be more than the ordered tasks (hybrid jobs!) # PLEASE DO NOT USE THE FOLLOWING OPTION: --mem=<size> memory needed per NODE
-o, --output=<filename> # -e, --error=<filename> # we do not recommend to use this, analyzing problems is easier, if STDERR is merged into STDOUT
Wall Clock Limit
Advanced job parameters
Submit your job for project <projectname>. This additionally chooses the default partition for you.
# request one gpu per node
the right partition will be chosen for you, you do not need to request a partition.
In general it is not required or recommended to submit a job to a specific partition, since the selection is driven by the project and/or the specified job requirements. However, in some cases (e.g., performance analysis of specifice hardware) it might be relevant.
# select a partition
A list of partitions can be found here or you can use the
|sinfo or r_wlm_usage -p <proj> -q|
command as documented above.
All our compute node have local SSDs integrated. In contrast to network file systems like $HOME, $WORK or the parallel file system $HPCWORK these SSDs are local devices. Thus, a job making use of them might benefit from this local devices in terms of performance. Furthermore, the performance might be much better for jobs using many small files compared to the performance on $HPCWORK.
In order to make use of these local SSDs in multiple node jobs you can use BeeGFS On-Demand (BeeOND) by adding the following line your submission
This will set up a shared, temporary (!) BeeGFS file system across all nodes allocated for your job, which means the data can be accessed by each process (e.g., MPI rank) from any node involved. You can access the file system using the path stored in the environment variable $BEEOND. Please note: It is a temporary file system which only lives as long as your batch job is running. All data will be deleted in the epilog of the batch job. Thus, you have to copy back all relevant data to $HOME, $WORK or $HPCWORK in your batch job.
Since BeeGFS is a parallel file system you can influence the striping of the file, where a stripe of 1 means to keep local to the current node and a stripe of n means a distribution of the file to n nodes (i.e., n SSDs). The distribution is done with a specified chunk size. For instance, you can change the striping to 16 and the chunk size to 1 MB by using the following command:
|$ beegfs-ctl --setpattern --numtargets=16 --chunksize=1m|
The numtargets must not be higher than the amount of involved compute nodes.
A typical workflow for you job might be as following:
1. Request a BeeOND file system by adding "#SBATCH --beeond" to your batch script.
2. Copy the required raw data to $BEEOND.
3. Change the directory to the corresponding directory (e.g., $BEEOND/yourdata).
4. Start the preprocessing of your job (e.g., the domain decomposition).
5. Start your application.
6. Copy back all relevant (and only the relevant!) data/results. This is very important, because you cannot access the data after the job or from any other not (e.g. a dialog system).