What is SLURM?
SLURM (Simple Linux Utility for Resource Management) is a cluster management and job scheduling system (scheduler). It is the most popular and widely used job scheduler in the world.
- The purpose of a scheduler like SLURM is to:
- Allocating access to resources (compute nodes) to users
- Providing a framework for starting, executing, and monitoring work (jobs)
- Arbitrating contention for resources by managing a queue of pending work
- There are many types of scheduler. For example, at Sidra Medicine, their HPC system uses LSF. In our QU Azure HPC, we use SLURM.
- What are the important SLURM commands you need to be aware of:
- sbatch
- to submit job(s) to the HPC nodes
- example: sbatch file.sh
- squeue
- to get the status of submitted jobs
- example: squeue -j jobID
- scancel
- to cancel/terminate currently running/submitted jobs
- example: scancel jobID
- sinfo
- Get information about the available HPC nodes
- sacct
- Display accounting data for all jobs and job steps
- example: sacct –jobs=jobID
- example: sacct –user=username
- sbatch
- Before you can submit a job to the HPC nodes, there are a few terms you need to be aware of
- terminal node
- this is the server that you have been using since morning
- we navigate, view, write scripts, etc in the terminal node
- this terminal node is shared by everyone – similar to the “landing page“
- we should NEVER run jobs or run bioinformatics program in the terminal node. this is because if a person runs a heavy job inside the terminal node, it will affect other people.
- compute node
- this is a specialize high performance server. All of them are connected together in what we call as “clusters”.
- we submit our jobs to these servers. their sole purpose is to run heavy duty jobs/programs.
- you cannot login into these servers
- terminal node
- Before you can submit a job to the compute nodes, you need to write a SLURM batch script. A sample of a script is given below:
- /obes/workshop/rozaimi/scripts/template.sh
- Explanation:
- –nodes= 1
- This line tells SLURM to allocate 1 node for this job.
- –ntasks=1
- This line tells SLURM that the job will launch 1 task
- –cpus-per-task=1
- This line tells SLURM that each task will use 1 CPU. This is useful for jobs that are multithreaded and can take advantage of multiple CPUs
- –partition=hpc
- This line tells SLURM to submit the job to the ‘hpc’ partition (a partition is a division of the nodes of the cluster)
- –nodelist=biomedicalsciences-hpc-pg0-1
- This line tells SLURM to run the job on a specific node, in this case, ‘biomedicalsciences-hpc-pg0-1’
- –output= and –error=
- sometimes bioinformatics tools will print useful information. this will go either to the standard output (–output) or standard error (–error).
- Usually if a job failed or a bioinformatics tool encounter a problem, you can get useful information if you look inside the standard error file
- –nodes= 1
- Majority of the time, you only need to change
- –cpus-per-task
- –partition
- –nodelist
- –output
- –error