- Job Submission
- Job script skeleton
- Job Cancellation
- Job Monitoring
- Job Efficiency
- Job Accounting
- Partition State
- Basic Job Parameters
- Submitting with a project
- Submitting a GPU job
- Submitting to specific partition
- BeeOND (BeeGFS On-Demand)
We strongly recommend to use a batch script and submit them with
sbatch BATCH_SCRIPT. Support is much more difficult and time consuming without a batch script to debug, so issues take longer to solve.
A batch job using a script called "jobscript.sh" that runs your program can be submitted like this:
#!/usr/bin/zsh ### SBATCH Section # directives need to be in the beginning of the jobscript ### Your Program Section # goes here, the second part of the jobscript
!!! DON'T MIX THOSE TWO SECTIONS!!!
#SBATCH directives should not be mixed with your own code or programs! The aforementioned Shebang (first line of the jobscript) is the only supported shebang. If you have a problem with your script, which cannot be reproduced with the zsh, don't expect to get a thorough analysis of the problem.
After submitting your job script, you will obtain a job id from Slurm. This job id is important to identify your batch job, cancel it, and for support tickets.
To cancel your batch job any time, you can use the command
Please Note: The commands
spartmust not be used for high frequency monitoring of the batch system state. If you want to use these commands in a periodic manner, please execute them at most once every 30 seconds.
squeue displays your running and pending jobs. You can see a list of your own queued and running jobs like this:
squeue -u $USER
squeue -u $USER --start
to show the expected start time and likely resources to be allocated to your pending jobs. This start time is only an estimate, is not guaranteed and might change due to higher priority jobs or job backfilling.
spart - shows user specific partition information with core count of available nodes and pending jobs. Please find detailed information here.
spart QUEU STA FREE TOTAL RESORC OTHER FREE TOTAL || DEFMEM CORES NODE PARTITIO TUS CORES CORES PENDNG PENDNG NODES NODES || GB/CPU /NODE MEM-GB c18m 2737 59520 1104 2574 20 1240 || 3 48 187 c18m_low 2737 59520 336 96 20 1240 || 3 48 187 c18m_verylow 2737 59520 0 0 20 1240 || 3 48 187 c18g 1330 2592 40 8641 8 54 || 3 48 187 c18g_low 1330 2592 0 0 8 54 || 3 48 187 c18g_verylow 1330 2592 0 0 8 54 || 3 48 187 c16s 288 288 0 0 2 2 || 7 144 1020 c16s_low 288 288 0 0 2 2 || 7 144 1020 c16s_verylow 288 288 0 0 2 2 || 7 144 1020 dgx2 54 96 0 0 0 2 || 24 48 1508 dgx2_low 54 96 0 0 0 2 || 24 48 1508 dgx2_verylow 54 96 0 0 0 2 || 24 48 1508 ih 2898 3360 0 0 66 83 || 1 24 61 YOUR PEND PEND YOUR DEFAULT MAXIMUM RUN RES OTHR TOTL JOB-TIME JOB-TIME COMMON VALUES: 0 0 0 0 15 mins 30 days
After a job has finished, you can get a job efficiency report. This provides information about CPU Efficiency and Memory Efficiency. The command is:
which shows the output
Job ID: 12345678 Cluster: rcc User/Group: ab123456/ab123456 State: CANCELLED (exit code 0) Nodes: 1 Cores per node: 48 CPU Utilized: 10:04:49 CPU Efficiency: 34.81% of 1-04:57:36 core-walltime Job Wall-clock time: 00:36:12 Memory Utilized: 164.03 GB Memory Efficiency: 89.73% of 182.81 GB
Batch jobs consume core-hours from the resources defined in your batch script. The used core-hours are taken from your quota.
You should use
r_wlm_usage to show your consumed quota, updated after each day.
The following is an example output for a fake project rwth1234 with
r_wlm_usage -p rwth1234 -q
Account: rwth1234 Type: rwth-m Start of Accounting Period: 01.05.2018 End of Accounting Period: 01.05.2019 State of project: active Quota monthly (core-h): 200000 Total quota (core-h): 2.400 Mio Remaining core-h of prev. month: 0 Consumed core-h current month: 0 Consumable core-h (%): 200 Consumable core-h: 200000 Default partition: c18m Allowed partitions: c18m,c18g Max. allowed wallclocktime: 1.0 days
Please also see Slurm Accounting on how jobs are accounted with Slurm.
r_wlm_usage is the Slurm version of
r_batch_usage; to see old LSF usage, please use
sinfo can give you information about the partitions:
This shows you the available partitions, e.g.
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST c18m up 30-00:00:0 851/20/369/1240 ncm[0001-1032],nrm[001-208]
You see the partiton name, the state of the partition, the maximum wallclock time for a job in the partition, the state of the nodes and the nodelist. The nodestates are (in contrast to the manpage) (A)llocated/(I)dle/(O)ther/(T)otal. So, take the second column of the notestates into consideration if you are looking for free hosts.
Short and long parameters:
Please consider the following example for illustration:
-c <numcpus>is the shortform
--cpus-per-task=<numcpus>is the long form
Please remark that the shortform expects a blank after the parameter while the long form expects a '='.
The Slurm parser stops parsing the
#SBATCH directives if it hits a line with a 'normal' command, i.e. a line that isn't empty or starts with
#. All following parameters will be completely ignored.
#!/usr/bin/zsh ls #SBATCH -J "my jobname"
The jobname will not be set in this case. Should you experience batch arguments being ignored by Slurm, please also double-check for spelling mistakes in both your parameters and arguments. Coming across expressions like
--output=namewith blanks.%J will cause the Slurm parser to terminate and ignore subsequent
#SBATCH directives. Be advised that indentation of
#SBATCH directives is not supported.
-c, --cpus-per-task=<numcpus>for OpenMP/Hybrid
-n, --ntasks=<numtasks>for Processes/MPI
--mem-per-cpu=<size>Memory needed per allocated CPU, which can be more than the ordered tasks (hybrid jobs!) # PLEASE DO NOT USE THE FOLLOWING OPTION:
--mem=<size>memory needed per NODE
-e, --error=<filename>We do not recommend to use this, analyzing problems is easier, if STDERR is merged into STDOUT
Wall Clock Limit
Submit your job for project
<projectname>. This additionally chooses the default partition for you.
# request one gpu per node --gres=gpu:<type>:1 # request two gpus per node --gres=gpu:<type>:2 # request two volta gpus (CLAIX18) --gres=gpu:volta:2 # request two pascal gpus (CLAIX16) --gres=gpu:pascal:2 # please note that a batch job ordering one GPU must be non-exclisive in order not to block the remaining GPU of the node
The right partition will be chosen for you, you do not need to request a partition.
In general it is not required or recommended to submit a job to a specific partition, since the selection is driven by the project and/or the specified job requirements. However, in some cases (e.g., performance analysis of specifice hardware) it might be relevant.
# select a partition -p <partition>
A list of partitions can be found here or you can use the
r_wlm_usage -p <project> -q.
All our compute node have local SSDs integrated. In contrast to network file systems like
$WORK or the parallel file system
$HPCWORK these SSDs are local devices. Thus, a job making use of them might benefit from this local devices in terms of performance. Furthermore, the performance might be much better for jobs using many small files compared to the performance on
In order to make use of these local SSDs in multiple node jobs you can use BeeGFS On-Demand (BeeOND) by adding the following line your batch script:
Please Note: The job will become exclusive. This means that all cores of every allocated node will be allocated to this job. Therefore increasing your consumption of corehours.
This will set up a shared, temporary (!) BeeGFS file system across all nodes allocated for your job, which means the data can be accessed by each process (e.g., MPI rank) from any node involved. You can access the file system using the path stored in the environment variable
Please Note: It is a temporary file system which only lives as long as your batch job is running. All data will be deleted in the epilog of the batch job. Thus, you have to copy back all relevant data to
$HPCWORK in your batch job.
beegfs-ctl --setpattern --numtargets=16 --chunksize=1m
The numtargets must not be higher than the amount of involved compute nodes.
A typical workflow for you job might be as following:
- Request a BeeOND file system by adding
#SBATCH --beeondto your batch script.
- Copy the required raw data to
- Change the directory to the corresponding directory (e.g.,
- Start the preprocessing of your job (e.g., the domain decomposition).
- Start your application.
- Copy back all relevant (and only the relevant!) data/results. This is very important, because you cannot access the data after the job or from any other not (e.g. a login node).