What is Slurm?
Slurm as a Workload Manager and Scheduler
Slurm is a resource allocator that assigns users to compute nodes for a period of time to run computations. Multiple user programs can queue and run at the same time; Slurm takes care of scheduling and running them on their own resources.
- Users can interact with Slurm from within the cluster, usually through the login nodes.
- Users request cores, main memory, time and then send their programs to Slurm to be queued.
- Slurm reserves resources (CPU for traditional HPC with MPI and GPUs for ML) and waits for the turn in the waiting queue.
- After a waiting period, the resources are allocated and the program is run on compute nodes.
The following diagram shows a simplified view of how Slurm manages Jobs on the CLAIX Compute Cluster:
In the previous diagram a user access a login node, then creates a Slurm Job, sends it to the Slurm Queue and waits for the resources to be allocated, with the computations following after that.