Sie befinden sich im Service: RWTH High Performance Computing (Linux)

Help with Slurm Errors

Help with Slurm Errors

Kurzinformation

In the following site you can find an overview on common errors and how to fix them as well as a list of pending job reasons and their meaning.

Many errors can be solved by carefully reading and applying the syntax defined in the sbatch documentation.


Detailinformation  Submission Errors

sbatch: error: Unable to allocate resources: Invalid account or account/partition combination specified

Please follow the correct syntax for accounts:
#SBATCH --account=<project>
#SBATCH -A <project>

Or  follow the correct syntax with valid partitions determined by your project proposal:
#SBATCH --partition=<partition>
#SBATCH -p <partition>

---------------------------------

sbatch: error: Unable to allocate resources: Invalid partition name specified

Please follow the correct syntax with valid partitions:
#SBATCH --partition=<partition>
#SBATCH -p <partition>

----------------------------------

sbatch: error: Batch job submission failed: Requested node configuration is not available

Please make sure that the amount of total CPUs requested exists within the requested nodes.
Please make sure that the amount of memory per Node exists.

Check the proper syntax:
#SBATCH --ntasks=<CPUs>
#SBATCH -n <CPUs>

If used:
#SBATCH --cpus-per-task=<CPUs>
#SBATCH -c <CPUs>

#SBATCH  --mem=<AMOUNT>
#SBATCH  --mem-per-cpu=<AMOUNT> #times the amount of CPUS allocated (ntasks * cpus-per-task)


Detailinformation Job Failures

Job fails and generates no output files

If the Job cannot write its output to output files specified in the batch script, the job will die and no file will be created.
The job batch script has to specify a valid path to a directory and the file to write the output to, must be a writable file!

Please check typos and the correct syntax of:

#SBATCH --output=/path/to/file/file_name
#SBATCH --error=/path/to/file/file_name
#SBATCH --o /path/to/file/file_name
#SBATCH --e /path/to/file/file_name

Do not use $HOME , $WORK , ~ or similar variables in this output! Only explicit full paths and Slurm % special flags.

----------------------------------

Job fails due to time limit

slurmstepd: error: JOB xxxxxxx ON XXXX CANCELLED AT XXXXXXXXXXXX DUE TO TIME LIMIT 

srun: error: XXXXX : task XXXX: Killed

Slurm allocates ONLY the requested resources defined in the submission batch script.
If your program runs for longer than the originally requested time limit, it will be killed without warning and results from the computation will not be guaranteed.


Please ensure you request enough time (with the correct syntax) for your program to finish computations across all nodes:
#SBATCH  --time=HH:MM:SS
#SBATCH  -t =HH:MM:SS

----------------------------------

Job fails due to memory limit (oom-kill)

 slurmstepd: error: Detected X oom-kill event(s) in XXXXXXXXXXXXX

Some of your processes may have been killed by the cgroup out-of-memory handler.

srun: error: XXXXX : task XXXX: Killed

Slurm allocates ONLY the requested resources defined in the submission batch script.
If your program uses more memory than the originally requested memory limit per node, it will be killed without warning and results from the computation will not be guaranteed.


Please ensure you request enough time (with the correct syntax) for your program to finish computations across all nodes:
#SBATCH  --mem=<size>[units] 
#SBATCH  -m <size>[units]

Or if used:
#SBATCH  --mem-per-cpu=<size>[units]

Different units can be specified using the suffix [K|M|G|T], the default is M for Megabytes (1000M = 1G)

----------------------------------

Job runs but RLIMIT_NPROC is shown as an error

slurmstepd: error: Can't propagate RLIMIT_NPROC of XXXXXXX from submit host: Invalid argument

This is a known problem with our current installation of Slurm and can be ignored. It should also not affect your program or results in any meaningful way.

 

Detailinformation Wait Status

The squeue command displays several pending reasons for waiting jobs which we will discuss in the following:

  • None
    • Explanation: The job is freshly queued and has not yet been considered by Slurm.
    • Solution: just wait for a real reason
  • Resources
    • Explanation: The job is waiting for allocated resources and will be next in line once they become free.
    • Solution: just wait
  • Priority
    • Explanation: There exists at least one job with a higher priority
    • Solution: just wait
  • Dependency
    • Explanation: The job is waiting for a dependency to be fulfilled
    • Solution: just wait, it should run as soon as the dependency is fulfilled, you could also check the jobs dependencies for errors
  • JobArrayTaskLimit
    • Explanation: The maximum amount of simultaneously running array tasks has been reached, e.g., if you used "--array=0-15%4"  and four tasks are running, the rest of the eligible tasks will have this pending reason
    • Solution: Wait for the running array tasks to finish
  • AssocMaxWallDurationPerJobLimit
    • Explanation: The maximum allowed wallclock time for the job / partition / account has been exceeded
    • Solution: This job will never start. Please delete the job, compare the corresponding Max. allowed wallclocktime: entry of the r_wlm_usage -q output, correct your script accordingly, resubmit the modified job.
  • AssocMaxCpuPerJobLimit
    • Explanation: The maximum number of allowed cores for the job / partition / account has been exceeded
    • Solution: This job will never start. Please delete the job, compare the corresponding Max. allowed cores per job: entry of the r_wlm_usage -q output, correct your script accordingly, resubmit the modified job.

 

zuletzt geändert am 23.02.2024

Wie hat Ihnen dieser Inhalt geholfen?

Creative Commons Lizenzvertrag
Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Deutschland Lizenz