SLURM¶
Slurm is an open-source job scheduler and resource management system used in high-performance computing (HPC) environments. It allows users to submit and manage jobs on a cluster of computers, allocating resources such as CPU time, memory, and GPUs. Slurm is commonly used in scientific computing, data analysis, and other compute-intensive tasks, and it is designed to be scalable and efficient for large-scale computing clusters.
Definitions¶
A combination of raw technical detail, Slurm’s loose usage of the terms core and CPU and multiple models of parallel computing require establishing a bit of background to fully explain how to make efficient use of multiple cores on ORFEO.
node: it represent the single computational machine.socket: in Slurm's jargon it represent the single processor with all cores that belong to it.core: it represent the hardware units that fill the sockets.thread: hardware thread provided by the SMT capability of the processorcpu: depending upon system configuration, this can be either a core or a thread. In case of simmetrical multi threading equal to one, it represent a core.task: a task is an instance of the executed command, in case of single program and multiple data is usefull.
cpu-task-core
The terms cpu and core may occasionally overlap in meaning, and in some contexts, they might be conflated with the concept of a socket. The interpretation of these terms is often context-dependent, and in this documentation, adhere to the meanings clarified earlier.
Information gathering¶
Slurm uses the concept of associations to group users and groups/projects, and it helps in managing access to resources on the cluster. When logging into a cluster that uses the Slurm workload manager, understanding the association for your account is the first step to understanding which resources.
$ sacctmgr list associations Users=$(whoami) format=Account,User,Partition
Account User Partition
---------- ---------- ----------
dssc user00 gpu
dssc user00 epyc
dssc user00 thin
Where all the tuples account/partition our user is allowed to use when asking
for resources are listed. Then sinfo prints information about nodes and
partitions.
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
EPYC up 6-06:00:00 1 mix epyc001
EPYC up 6-06:00:00 1 alloc epyc002
EPYC up 6-06:00:00 5 idle epyc[004-008]
THIN up 6-06:00:00 3 idle thin[007-008,010]
GPU up 6-06:00:00 1 alloc gpu003
Common options for this command are:
-l --longprint more detailed information-N --Nodeprint information in a node-oriented format
when specified these options give rise to more detailed outputs:
$ sinfo -lN
Thu Mar 30 10:46:02 2023
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
epyc001 1 EPYC mixed 128 2:64:1 512000 0 1 (null) none
epyc002 1 EPYC allocated 128 2:64:1 512000 0 1 (null) none
epyc004 1 EPYC idle 128 2:64:1 512000 0 1 (null) none
epyc005 1 EPYC idle 128 2:64:1 512000 0 1 (null) none
epyc006 1 EPYC idle 128 2:64:1 512000 0 1 (null) none
epyc007 1 EPYC idle 128 2:64:1 512000 0 1 (null) none
epyc008 1 EPYC idle 128 2:64:1 512000 0 1 (null) none
gpu003 1 GPU allocated 48 2:12:2 240000 0 1 (null) none
thin007 1 THIN idle 24 2:12:1 768000 0 1 (null) none
thin008 1 THIN idle 24 2:12:1 768000 0 1 (null) none
thin010 1 THIN idle 24 2:12:1 768000 0 1 (null) none
In this case, S:C:T stays for Socket:Core:Threads. The state can be idle if
the node is not used, allocated if the code is fully booked, and mixed if there
is still space for more jobs on that node. For a complete list of possible node
states, visit the official documentation
Another command with a useful output format, with very detailed information about cluster status that can help in deciding how many resources to ask to fill the empty slots:
$sinfo -N --format="%.15N %.6D %.10P %.11T %.4c %.10z %.8m %.10e %.9O %.15C"
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY FREE_MEM CPU_LOAD C PUS(A/I/O/T)
dgx001 1 DGX idle 256 2:64:2 1000000 1017901 0.77 0/256/0/256
epyc001 1 EPYC mixed 128 2:64:1 512000 495721 0.02 4/124/0/128
epyc002 1 EPYC allocated 128 2:64:1 512000 473239 1.84 128/0/0/128
epyc004 1 EPYC idle 128 2:64:1 512000 509546 0.00 0/128/0/128
epyc005 1 EPYC idle 128 2:64:1 512000 508064 0.00 0/128/0/128
Where CPUS(A/I/O/T) indicates Allocated/Idle/Other/Total cores
squeuelists jobs in the queue.
$ squeue --long
Thu Mar 30 10:52:35 2023
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)scom
3551 EPYC xfran User1 RUNNING 10:24 2-00:00:00 1 epyc002
3548 EPYC interact User2 RUNNING 42:57 2:00:00 1 epyc001
3546 GPU bash User3 RUNNING 51:55 1:00:00 1 gpu003
And the same command with a more formatted output to monitor queue:
$squeue --format="%.8i %.9P %.8j %.8u %.9M %.6D %.5C %.7m %.20R %.8T"
JOBID PARTITION NAME USER TIME NODES CPUS MIN_MEM NODELIST(REASON) STATE
3551 EPYC xfran User1 11:33 1 128 499G epyc002 RUNNING
3548 EPYC interact User2 44:06 1 4 1G epyc001 RUNNING
3546 GPU bash User3 53:04 1 48 1G gpu003 RUNNING
The possible job STATE are fully described in the official documentation
Interactive usage¶
Allocate resources¶
salloc obtains requested resource allocation, executes a command (or program,
script), and finally releases the allocated resource when the command (or
program, script) is finished. If a script is launched, it can run several
srun instances (these are called job steps).
A new bash session will start on the login node when the allocation starts. This approach is useful for running a GUI on the login node but your processes on the compute nodes.1
Example: salloc -p EPYC --account dssc --nodes 1 --tasks 4 <my_command>
List of relevant options:
--nodes -NThe number of nodes for the job (computers).--memThe amount of memory per node your job needs; if not specified, there is a default of 1GB for each allocated core.--ntasks -nThe total number of tasks your job requires.--ntasks-per-nodeSimilar to the option above, but the requested amount is for each allocated node.--gpus=#The number of GPUs per node you need in your job.--mem-per-cpuThe amount of memory per cpu your job requires.--exclusiveThis will get you exclusive node usage.
Use allocated resources¶
If no command is specified, the default user shell is launched. Once resources
are allocated, it is possible to run interactively MPI program across
nodes, using mpirun or srun.
Example of salloc interactive use:
[user@loginNode]$ salloc -p EPYC -A dssc --nodes 2 --tasks-per-node 1 --time=00:30:00
salloc: Granted job allocation 174
[user@loginNode]$ squeue
JOBID PARTITION NAME USER TIME NODES CPUS MIN_MEM NODELIST(REASON) STATE
174 EPYC interact user 0:04 2 2 1G epyc[006-007] RUNNING
[user@loginNode]$ mpirun -np 2 <my_program> ...
[user@loginNode]$ mpirun -np 2 <my_program> ...
In the following example, we allocate some resources, and consume them with
srun. Please note that the shell is still on the loginNode.
# Resource allocation
[user@loginNode]$ salloc -n4 -N4 --cpus-per-task=2 -p EPYC -A dssc --time=0:10:0 --mem=50GB
salloc: Granted job allocation 3556
salloc: Waiting for resource configuration
salloc: Nodes epyc[004-007] are ready for job
# Command 1
[user@loginNode]$ srun hostname
epyc007
epyc004
epyc005
epyc006
# Command 2
[user@loginNode]$ srun -n2 -N2 hostname
epyc004
epyc005
# Command 3
[user@loginNode]$ srun -n8 -N8 hostname
srun: error: Only allocated 4 nodes asked for 8
In the example we allocate 4 nodes with one task each, each task can use up to
2 cores. Without any arguments, as done in command one, srun use all the
allocated resources, it run the command hostname in each node (virtually,
each hostname invocation can use 2 cores). With the second srun command we
use only 2 nodes by specifying the correct flags, with one task each. The last
srun -n8 -N8 request cannot be satisfied, hence it results in an error.
Release resources
At the end of your session, remember to release the resources:
Run without prior allocation¶
The srun command can also be used to submit an interactive job as follow:
[user@loginNode ~]$ srun --partition EPYC -A dssc --nodes 1 --tasks 1 --cpus-per-task=24 --time=1:00:00 --pty bash
[user@computeNode ~]$ squeue -l
Tue Oct 11 00:40:53 2022
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
80 EPYC bash User1 RUNNING 0:04 1:00:00 1 epyc007
Doing that we requested
- One single node with the argument
-N 1or--nodes 1 - One single task with the argument
-n 1or--tasks 1 - 24 cores dedicated to the single task requested, with
-c 24
Note the bash hostname changed from loginNode to computeNode. Now you
effectively have a shell on the compute node.
Shorter command version: srun -p EPYC -A dssc -N 1 -n 11 -c 24 -t=1:00:00 --pty bash
Warning
Using --pty bash interactive will spawn a shell on the computational
node, which differs from salloc. However, this method of consuming
resources is discouraged and deprecated. It's also important to note that
using --pty bash in an interactive session to launch an MPI job across
nodes won't work. Instead, use salloc to launch your MPI job.
Release the resource
Remember to release the allocated resources, with the command exit:
Note again that the shell canged form computeNode back to loginNode
The srun command can also be used to submit an interactive job as follow
It's possible to use the srun command also for allocating a GPU. The only detail to add is just --gpus=1.
[user@loginNode ~]$ srun --partition GPU -A dssc --nodes 1 --tasks 1 --cpus-per-task=24 --gpus=1 --time=1:00:00 --pty bash
[user@computeNode ~]$ squeue -l
Tue Oct 11 00:40:53 2022
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
90 GPU bash User1 RUNNING 0:04 1:00:00 1 GPU002
Non interactive usage¶
This approach is the standard usage expected from HPC workflows (best
practice). The sbatch command can be used to submit a script that requests
resource and executes the job:
$ cat job.sh
#!/bin/bash
#SBATCH --partition=EPYC
#SBATCH --account=dssc
#SBATCH --job-name=my_super_job
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=200gb
#SBATCH --time=00:05:00
#SBATCH --output=my_job_%j.out
pwd; hostname; date
echo "Hello, world !"
Let's explore the directive sent to sbatch:
- We specified the desired partition:
--partition=EPYC - We specified the desired account where the resources are going to be billed:
--account=dssc - We named our job, this name will be displayed in the
squeueoutput:--job-name=my_super_job - The resources requested are: one node with 24 cores and 200GB of RAM
- The job can run 5 minutes:
--time=00:05:00 - The output of the job will be placed in a file named
my_job_%j.out,%jwill be thejobid.
This file can be submitted as a script with the sbatch command as follows:
Differently, requested resources as in the above example (and options) can be specified directly by command line like:
Others common options:
--tasks-per-node=<#nodes>specify the number of tasks for each node-D , --chdir=<directory>set the working directory of the batch script to directory before it is executed, example of desired pathSLURM_SUBMIT_DIR, available as enviroment variable-o, --output=<file_path>redirect stdout and stderr to the specified file, the default file name isslurm-%j.out, where the%jis replaced by the job ID.
It is generally good practice to look at manuals with man sbatch and
man srun to explore the many other options.
Manipulate a job¶
To know everything about a specific job:
$ scontrol show jobid 178
JobId=178 JobName=interactive
UserId=user(1000001) GroupId=user00(1000) MCS_label=N/A
Priority=4294901582 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
RunTime=00:00:14 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2022-10-11T17:02:58 EligibleTime=2022-10-11T17:02:58
StartTime=2022-10-11T17:02:58 EndTime=2022-10-11T17:32:58 Deadline=N/A
Partition=EPYC AllocNode:Sid=login02:168052
NodeList=epyc[006-007]
NumNodes=2 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=4G,node=2,billing=4
MinCPUsNode=2 MinMemoryCPU=1G MinTmpDiskNode=0
... etc ...
In order to cancel PENDING jobs or RUNNING jobs, issue the
command scancel <jobid>.
The jobid is the one specified from the squeue. Example:
$ squeue
JOBID PARTITION NAME USER TIME NODES CPUS MIN_MEM NODELIST(REASON) STATE
196 EPYC interact user 0:02 1 1 1G epyc006 RUNNING
$ scancel 196
$ salloc: Job allocation 196 has been revoked.
Another useful example: scancel --state=PENDING --user=<my_user>
Request tasks or cpus?¶
To clarify the distinction between tasks and cpus, let's examine the following code scenarios:
| Scenario | options |
|---|---|
| Serial | --ntask-per-node=1 --cpus-per-task=1 |
| Multithread | increase --cpu-per-task while keeping --task-per-node=1. |
| Multinode/Parallel MPI | increase --ntasks-per-node while keeping --cpus-per-task=1 |
| both paradigms | adjustment of the two parameters. |
Slurm Commander¶
scom is a light TUI that help to manage the jobs in the cluster, gather info about them and keep track of the work.
from the official repo
SlurmCommander is a simple, lightweight, no-dependencies text-based user interface (TUI) to your cluster. It ties together multiple slurm commands to provide you with a simple and efficient interaction point with slurm.
write scom in your bash to enable it.