SLURM¶

Slurm is an open-source job scheduler and resource management system used in high-performance computing (HPC) environments. It allows users to submit and manage jobs on a cluster of computers, allocating resources such as CPU time, memory, and GPUs. Slurm is commonly used in scientific computing, data analysis, and other compute-intensive tasks, and it is designed to be scalable and efficient for large-scale computing clusters.

Definitions¶

A combination of raw technical detail, Slurm’s loose usage of the terms core and CPU and multiple models of parallel computing require establishing a bit of background to fully explain how to make efficient use of multiple cores on ORFEO.

node: it represent the single computational machine.
socket: in Slurm's jargon it represent the single processor with all cores that belong to it.
core: it represent the hardware units that fill the sockets.
thread: hardware thread provided by the SMT capability of the processor
cpu: depending upon system configuration, this can be either a core or a thread. In case of simmetrical multi threading equal to one, it represent a core.
task: a task is an instance of the executed command, in case of single program and multiple data is usefull.

cpu-task-core

The terms cpu and core may occasionally overlap in meaning, and in some contexts, they might be conflated with the concept of a socket. The interpretation of these terms is often context-dependent, and in this documentation, adhere to the meanings clarified earlier.

Information gathering¶

Slurm uses the concept of associations to group users and groups/projects, and it helps in managing access to resources on the cluster. When logging into a cluster that uses the Slurm workload manager, understanding the association for your account is the first step to understanding which resources.

$ sacctmgr list associations Users=$(whoami) format=Account,User,Partition
   Account       User  Partition
---------- ---------- ----------
      dssc     user00        gpu
      dssc     user00       epyc
      dssc     user00       thin

Where all the tuples account/partition our user is allowed to use when asking for resources are listed. Then sinfo prints information about nodes and partitions.

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
EPYC         up 6-06:00:00      1    mix epyc001
EPYC         up 6-06:00:00      1  alloc epyc002
EPYC         up 6-06:00:00      5   idle epyc[004-008]
THIN         up 6-06:00:00      3   idle thin[007-008,010]
GPU          up 6-06:00:00      1  alloc gpu003

Common options for this command are:

-l --long print more detailed information
-N --Node print information in a node-oriented format

when specified these options give rise to more detailed outputs:

$ sinfo -lN
Thu Mar 30 10:46:02 2023
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
epyc001        1      EPYC       mixed 128    2:64:1 512000        0      1   (null) none
epyc002        1      EPYC   allocated 128    2:64:1 512000        0      1   (null) none
epyc004        1      EPYC        idle 128    2:64:1 512000        0      1   (null) none
epyc005        1      EPYC        idle 128    2:64:1 512000        0      1   (null) none
epyc006        1      EPYC        idle 128    2:64:1 512000        0      1   (null) none
epyc007        1      EPYC        idle 128    2:64:1 512000        0      1   (null) none
epyc008        1      EPYC        idle 128    2:64:1 512000        0      1   (null) none
gpu003         1       GPU   allocated 48     2:12:2 240000        0      1   (null) none
thin007        1      THIN        idle 24     2:12:1 768000        0      1   (null) none
thin008        1      THIN        idle 24     2:12:1 768000        0      1   (null) none
thin010        1      THIN        idle 24     2:12:1 768000        0      1   (null) none

In this case, S:C:T stays for Socket:Core:Threads. The state can be idle if the node is not used, allocated if the code is fully booked, and mixed if there is still space for more jobs on that node. For a complete list of possible node states, visit the official documentation

Another command with a useful output format, with very detailed information about cluster status that can help in deciding how many resources to ask to fill the empty slots:

$sinfo -N --format="%.15N %.6D %.10P %.11T %.4c %.10z %.8m %.10e %.9O %.15C"
       NODELIST  NODES  PARTITION       STATE CPUS      S:C:T   MEMORY   FREE_MEM  CPU_LOAD  C PUS(A/I/O/T)
         dgx001      1        DGX        idle  256     2:64:2  1000000    1017901      0.77     0/256/0/256
        epyc001      1       EPYC       mixed  128     2:64:1   512000     495721      0.02     4/124/0/128
        epyc002      1       EPYC   allocated  128     2:64:1   512000     473239      1.84     128/0/0/128
        epyc004      1       EPYC        idle  128     2:64:1   512000     509546      0.00     0/128/0/128
        epyc005      1       EPYC        idle  128     2:64:1   512000     508064      0.00     0/128/0/128

Where CPUS(A/I/O/T) indicates Allocated/Idle/Other/Total cores

squeue lists jobs in the queue.

$ squeue --long
Thu Mar 30 10:52:35 2023
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)scom
              3551      EPYC    xfran User1  RUNNING      10:24 2-00:00:00      1 epyc002
              3548      EPYC interact User2  RUNNING      42:57   2:00:00      1 epyc001
              3546       GPU     bash User3  RUNNING      51:55   1:00:00      1 gpu003

And the same command with a more formatted output to monitor queue:

$squeue --format="%.8i %.9P %.8j %.8u %.9M %.6D %.5C %.7m %.20R %.8T"
   JOBID PARTITION     NAME     USER      TIME  NODES  CPUS MIN_MEM     NODELIST(REASON)    STATE
    3551      EPYC    xfran User1     11:33      1   128    499G              epyc002  RUNNING
    3548      EPYC interact User2     44:06      1     4      1G              epyc001  RUNNING
    3546       GPU     bash User3     53:04      1    48      1G               gpu003  RUNNING

The possible job STATE are fully described in the official documentation

Interactive usage¶

Allocate resources¶

salloc obtains requested resource allocation, executes a command (or program, script), and finally releases the allocated resource when the command (or program, script) is finished. If a script is launched, it can run several srun instances (these are called job steps).

A new bash session will start on the login node when the allocation starts. This approach is useful for running a GUI on the login node but your processes on the compute nodes.¹

Example: salloc -p EPYC --account dssc --nodes 1 --tasks 4 <my_command>

List of relevant options:

--nodes -N The number of nodes for the job (computers).
--mem The amount of memory per node your job needs; if not specified, there is a default of 1GB for each allocated core.
--ntasks -n The total number of tasks your job requires.
--ntasks-per-node Similar to the option above, but the requested amount is for each allocated node.
--gpus=# The number of GPUs per node you need in your job.
--mem-per-cpu The amount of memory per cpu your job requires.
--exclusive This will get you exclusive node usage.

Use allocated resources¶

If no command is specified, the default user shell is launched. Once resources are allocated, it is possible to run interactively MPI program across nodes, using mpirun or srun.

Example of salloc interactive use:

[user@loginNode]$ salloc -p EPYC -A dssc --nodes 2 --tasks-per-node 1 --time=00:30:00
salloc: Granted job allocation 174
[user@loginNode]$ squeue
  JOBID PARTITION   NAME   USER    TIME  NODES  CPUS MIN_MEM   NODELIST(REASON)   STATE
   174    EPYC  interact   user    0:04    2     2      1G      epyc[006-007]  RUNNING
[user@loginNode]$ mpirun -np 2 <my_program> ...
[user@loginNode]$ mpirun -np 2 <my_program> ...

In the following example, we allocate some resources, and consume them with srun. Please note that the shell is still on the loginNode.

# Resource allocation
[user@loginNode]$ salloc -n4 -N4 --cpus-per-task=2  -p EPYC -A dssc --time=0:10:0 --mem=50GB
salloc: Granted job allocation 3556
salloc: Waiting for resource configuration
salloc: Nodes epyc[004-007] are ready for job
# Command 1
[user@loginNode]$ srun hostname
epyc007
epyc004
epyc005
epyc006
# Command 2
[user@loginNode]$ srun -n2 -N2 hostname
epyc004
epyc005
# Command 3
[user@loginNode]$ srun -n8 -N8 hostname
srun: error: Only allocated 4 nodes asked for 8

In the example we allocate 4 nodes with one task each, each task can use up to 2 cores. Without any arguments, as done in command one, srun use all the allocated resources, it run the command hostname in each node (virtually, each hostname invocation can use 2 cores). With the second srun command we use only 2 nodes by specifying the correct flags, with one task each. The last srun -n8 -N8 request cannot be satisfied, hence it results in an error.

Release resources

At the end of your session, remember to release the resources:

[user@loginNode ~]$ exit
exit
salloc: Relinquishing job allocation 3556
[user@loginNode ~]$

Run without prior allocation¶

The srun command can also be used to submit an interactive job as follow:

[user@loginNode ~]$ srun --partition EPYC -A dssc --nodes 1 --tasks 1 --cpus-per-task=24 --time=1:00:00 --pty bash
[user@computeNode ~]$ squeue -l
Tue Oct 11 00:40:53 2022
       JOBID  PARTITION   NAME      USER    STATE    TIME  TIME_LIMI  NODES NODELIST(REASON)
          80       EPYC   bash   User1  RUNNING    0:04    1:00:00        1      epyc007

Doing that we requested

One single node with the argument -N 1 or --nodes 1
One single task with the argument -n 1 or --tasks 1
24 cores dedicated to the single task requested, with -c 24

Note the bash hostname changed from loginNode to computeNode. Now you effectively have a shell on the compute node.

Shorter command version: srun -p EPYC -A dssc -N 1 -n 11 -c 24 -t=1:00:00 --pty bash

Warning

Using --pty bash interactive will spawn a shell on the computational node, which differs from salloc. However, this method of consuming resources is discouraged and deprecated. It's also important to note that using --pty bash in an interactive session to launch an MPI job across nodes won't work. Instead, use salloc to launch your MPI job.

Release the resource

Remember to release the allocated resources, with the command exit:

[user@computeNode ~]$ exit
exit
[user@loginNode ~]$

Note again that the shell canged form computeNode back to loginNode

The srun command can also be used to submit an interactive job as follow It's possible to use the srun command also for allocating a GPU. The only detail to add is just --gpus=1.

[user@loginNode ~]$ srun --partition GPU -A dssc --nodes 1 --tasks 1 --cpus-per-task=24 --gpus=1 --time=1:00:00 --pty bash

[user@computeNode ~]$ squeue -l
Tue Oct 11 00:40:53 2022
       JOBID  PARTITION   NAME      USER    STATE    TIME  TIME_LIMI  NODES NODELIST(REASON)
          90       GPU   bash   User1  RUNNING    0:04    1:00:00        1      GPU002

Non interactive usage¶

This approach is the standard usage expected from HPC workflows (best practice). The sbatch command can be used to submit a script that requests resource and executes the job:

$ cat job.sh
#!/bin/bash
#SBATCH --partition=EPYC
#SBATCH --account=dssc
#SBATCH --job-name=my_super_job
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=200gb
#SBATCH --time=00:05:00
#SBATCH --output=my_job_%j.out
pwd; hostname; date
echo "Hello, world !"

Let's explore the directive sent to sbatch:

We specified the desired partition: --partition=EPYC
We specified the desired account where the resources are going to be billed: --account=dssc
We named our job, this name will be displayed in the squeue output: --job-name=my_super_job
The resources requested are: one node with 24 cores and 200GB of RAM
The job can run 5 minutes: --time=00:05:00
The output of the job will be placed in a file named my_job_%j.out, %j will be the jobid.

This file can be submitted as a script with the sbatch command as follows:

$ sbatch my_job.sh
Submitted batch job 99

Differently, requested resources as in the above example (and options) can be specified directly by command line like:

sbatch --partition EPYC --nodes 1 --tasks 1 --cpus-per-task=24 --time=0:5:00 <my_executable>

Others common options:

--tasks-per-node=<#nodes> specify the number of tasks for each node
-D , --chdir=<directory> set the working directory of the batch script to directory before it is executed, example of desired path SLURM_SUBMIT_DIR, available as enviroment variable
-o, --output=<file_path> redirect stdout and stderr to the specified file, the default file name is slurm-%j.out, where the %j is replaced by the job ID.

It is generally good practice to look at manuals with man sbatch and man srun to explore the many other options.

Manipulate a job¶

To know everything about a specific job:

$ scontrol show jobid 178
JobId=178 JobName=interactive
  UserId=user(1000001) GroupId=user00(1000) MCS_label=N/A
  Priority=4294901582 Nice=0 Account=(null) QOS=(null)
  JobState=RUNNING Reason=None Dependency=(null)

  RunTime=00:00:14 TimeLimit=00:30:00 TimeMin=N/A
  SubmitTime=2022-10-11T17:02:58 EligibleTime=2022-10-11T17:02:58
  StartTime=2022-10-11T17:02:58 EndTime=2022-10-11T17:32:58 Deadline=N/A


  Partition=EPYC AllocNode:Sid=login02:168052
  NodeList=epyc[006-007]
  NumNodes=2 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  TRES=cpu=4,mem=4G,node=2,billing=4

  MinCPUsNode=2 MinMemoryCPU=1G MinTmpDiskNode=0

  ... etc ...

In order to cancel PENDING jobs or RUNNING jobs, issue the command scancel <jobid>.

The jobid is the one specified from the squeue. Example:

$ squeue
   JOBID PARTITION     NAME     USER      TIME  NODES  CPUS MIN_MEM     NODELIST(REASON)    STATE
     196      EPYC interact     user      0:02      1     1      1G              epyc006  RUNNING
$ scancel 196
$ salloc: Job allocation 196 has been revoked.

Another useful example: scancel --state=PENDING --user=<my_user>

Request `tasks` or `cpus`?¶

To clarify the distinction between tasks and cpus, let's examine the following code scenarios:

Scenario	options
Serial	`--ntask-per-node=1 --cpus-per-task=1`
Multithread	increase `--cpu-per-task` while keeping `--task-per-node=1`.
Multinode/Parallel MPI	increase `--ntasks-per-node` while keeping `--cpus-per-task=1`
both paradigms	adjustment of the two parameters.

Slurm Commander¶

scom is a light TUI that help to manage the jobs in the cluster, gather info about them and keep track of the work.

from the official repo

SlurmCommander is a simple, lightweight, no-dependencies text-based user interface (TUI) to your cluster. It ties together multiple slurm commands to provide you with a simple and efficient interaction point with slurm.

write scom in your bash to enable it.

More on salloc vs srun interactive shell nesi.org ↩