List

What is SLURM?

SLURM (Simple Linux Utility for Resource Management) is a cluster management and job scheduling system (scheduler). It is the most popular and widely used job scheduler in the world.

  • The purpose of a scheduler like SLURM is to:
    • Allocating access to resources (compute nodes) to users
    • Providing a framework for starting, executing, and monitoring work (jobs)
    • Arbitrating contention for resources by managing a queue of pending work
  • There are many types of scheduler. For example, at Sidra Medicine, their HPC system uses LSF. In our QU Azure HPC, we use SLURM.
  • What are the important SLURM commands you need to be aware of:
    • sbatch
      • to submit job(s) to the HPC nodes
      • example: sbatch file.sh
    • squeue
      • to get the status of submitted jobs
      • example: squeue -j jobID
    • scancel
      • to cancel/terminate currently running/submitted jobs
      • example: scancel jobID
    • sinfo
      • Get information about the available HPC nodes
    • sacct
      • Display accounting data for all jobs and job steps
      • example: sacct –jobs=jobID
      • example: sacct –user=username
  • Before you can submit a job to the HPC nodes, there are a few terms you need to be aware of
    • terminal node
      • this is the server that you have been using since morning
      • we navigate, view, write scripts, etc in the terminal node
      • this terminal node is shared by everyone – similar to the “landing page
      • we should NEVER run jobs or run bioinformatics program in the terminal node. this is because if a person runs a heavy job inside the terminal node, it will affect other people.
    • compute node
      • this is a specialize high performance server. All of them are connected together in what we call as “clusters”.
      • we submit our jobs to these servers. their sole purpose is to run heavy duty jobs/programs.
      • you cannot login into these servers
  • Before you can submit a job to the compute nodes, you need to write a SLURM batch script. A sample of a script is given below:
    • /obes/workshop/rozaimi/scripts/template.sh
  • Explanation:
    • –nodes= 1
      • This line tells SLURM to allocate 1 node for this job.
    • –ntasks=1
      • This line tells SLURM that the job will launch 1 task
    • –cpus-per-task=1
      • This line tells SLURM that each task will use 1 CPU. This is useful for jobs that are multithreaded and can take advantage of multiple CPUs
    • –partition=hpc
      • This line tells SLURM to submit the job to the ‘hpc’ partition (a partition is a division of the nodes of the cluster)
    • –nodelist=biomedicalsciences-hpc-pg0-1
      • This line tells SLURM to run the job on a specific node, in this case, ‘biomedicalsciences-hpc-pg0-1’
      • –output= and –error=
        • sometimes bioinformatics tools will print useful information. this will go either to the standard output (–output) or standard error (–error).
        • Usually if a job failed or a bioinformatics tool encounter a problem, you can get useful information if you look inside the standard error file
  • Majority of the time, you only need to change
    • –cpus-per-task
    • –partition
    • –nodelist
    • –output
    • –error