IT Center Help

Sie befinden sich im Service: RWTH High Performance Computing (Linux)

Job Management

This guide provides a brief reference to commonly used Slurm commands for job management. For a complete list of commands and their use, please refer to the Slurm reference and beginner's guide. You can also use the --help option or check the man pages (man <COMMAND>) for more information.

Please Note: The commands squeue, sinfo and spart should not be used for high-frequency job monitoring. If you need to use these commands periodically, execute them at most once every 2 minutes.

Table of Content

Submit Jobs

sbatch <BATCH_SCRIPT> [ADDITIONAL_ARGUMENTS]

Submits a batch script to the Slurm queue and creates a job with an associated job ID. The batch script should specify the job parameters, such as the ressource requirements, and the commands to be executed on the computing nodes. The job parameters can also be suplied as command-line arguments. For guidance on writing batch scripts, see in this short tutorial.

Display Job Queue

squeue [OPTIONS]

Displays pending and running jobs in the Slurm queue. Completed or canceled jobs are removed from this list.

squeue --me
Shows all of your currently pending and running jobs.
squeue --me --start
Here the estimated start time of pending jobs will be shown (estimate not guaranteed).
squeue --Format=state -p c23g -h | sort | uniq -c
This shows the number of running and pending jobs on the Claix-2023 GPU partition.

Cancel Jobs

scancel [OPTIONS] <JOB_ID>

Cancels one or more specified jobs. You can provide a comma or space-separated list of job IDs. Mind that by default this command does not provide a confirmation of cancellation.

scancel --me
Cancels all of your jobs.
scancel -v <JOB_ID>
Cancels the job with the corresponding ID and provides details of the process.

Request an interactive job

salloc [OPTIONS]

Requests a computing node for interactive use. Job parameters are specified similarly to sbatch. Once the job starts running, you will be redirected to the head node with an interactive shell. Exiting the shell terminates the job.

salloc -p c23g --gres=gpu:1 -n 24 -t 1:00:00
This will reserve a GPU and 24 cores for one hour on a node in Claix-2023.

Display Job Accounting and more

sacct [OPTIONS]

Prints details about your pending, running and past jobs.

sacct -S $(date -I --date="yesterday")
Shows all jobs that have been submitted since yesterday.
sacct -o JobName%15,JobID,AllocTres%70
Shows billing and allocated ressources.
sacct -o JobName%15,JobID,WorkDir%70
Shows the working directories.
sacct -o JobName%15,JobID,Start,End,Elapsed
Lists start, end and used time.

Display Information about the Cluster

sinfo [OPTIONS]

By default, lists the partitions, node states and corresponding host names of nodes.

sinfo -O Partition,NodeAIOT,CPUsState:30
Shows the current load of the cluster. Mind that free nodes or cores may not be immediately available as they could be reserved for larger jobs. The output shows the number of (partially) allocated (A) and idle (I) Nodes, as well as those that are in other states (O), such as down. In addition, the total number of nodes (T) is displayed.

spart [OPTIONS]

Provides information about the load on the cluster and the length of the Slurm queue.

Check Job Performance

seff [OPTIONS] <JOBID>

Provides an efficiency report for completed jobs, including data on core and memory usage. For more detailed monitoring, consider using additional job monitoring tools.

zuletzt geändert am 18.06.2024

Wie hat Ihnen dieser Inhalt geholfen?

Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Deutschland Lizenz