Submitting task/jobs using SLURM

List

May 28, 2024 | rr19488 | Uncategorized

What is SLURM?

SLURM (Simple Linux Utility for Resource Management) is a cluster management and job scheduling system (scheduler). It is the most popular and widely used job scheduler in the world.

The purpose of a scheduler like SLURM is to:
- Allocating access to resources (compute nodes) to users
- Providing a framework for starting, executing, and monitoring work (jobs)
- Arbitrating contention for resources by managing a queue of pending work

There are many types of scheduler. For example, at Sidra Medicine, their HPC system uses LSF. In our QU Azure HPC, we use SLURM.

What are the important SLURM commands you need to be aware of:
- sbatch
  - to submit job(s) to the HPC nodes
  - example: sbatch file.sh
- squeue
  - to get the status of submitted jobs
  - example: squeue -j jobID
- scancel
  - to cancel/terminate currently running/submitted jobs
  - example: scancel jobID
- sinfo
  - Get information about the available HPC nodes
- sacct
  - Display accounting data for all jobs and job steps
  - example: sacct –jobs=jobID
  - example: sacct –user=username

Before you can submit a job to the HPC nodes, there are a few terms you need to be aware of
- terminal node
  - this is the server that you have been using since morning
  - we navigate, view, write scripts, etc in the terminal node
  - this terminal node is shared by everyone – similar to the “landing page“
  - we should NEVER run jobs or run bioinformatics program in the terminal node. this is because if a person runs a heavy job inside the terminal node, it will affect other people.
- compute node
  - this is a specialize high performance server. All of them are connected together in what we call as “clusters”.
  - we submit our jobs to these servers. their sole purpose is to run heavy duty jobs/programs.
  - you cannot login into these servers

Before you can submit a job to the compute nodes, you need to write a SLURM batch script. A sample of a script is given below:
- /obes/workshop/rozaimi/scripts/template.sh

Explanation:
- –nodes= 1
  - This line tells SLURM to allocate 1 node for this job.
- –ntasks=1
  - This line tells SLURM that the job will launch 1 task
- –cpus-per-task=1
  - This line tells SLURM that each task will use 1 CPU. This is useful for jobs that are multithreaded and can take advantage of multiple CPUs
- –partition=hpc
  - This line tells SLURM to submit the job to the ‘hpc’ partition (a partition is a division of the nodes of the cluster)
- –nodelist=biomedicalsciences-hpc-pg0-1
  - This line tells SLURM to run the job on a specific node, in this case, ‘biomedicalsciences-hpc-pg0-1’
  - –output= and –error=
    - sometimes bioinformatics tools will print useful information. this will go either to the standard output (–output) or standard error (–error).
    - Usually if a job failed or a bioinformatics tool encounter a problem, you can get useful information if you look inside the standard error file

Majority of the time, you only need to change
- –cpus-per-task
- –partition
- –nodelist
- –output
- –error

Posts

May 28th, 2024

Installing software (packages) in Linux

There are many ways to install software (packages) inside Linux. For this workshop, we will learn how to install using […]

May 28th, 2024

Submitting task/jobs using SLURM

What is SLURM? SLURM (Simple Linux Utility for Resource Management) is a cluster management and job scheduling system (scheduler). It […]

May 28th, 2024

Basic navigation in Linux

What we will be learning? Introduction to Linux and High Performance Computing (HPC) Basic navigation in Linux Managing files/folders File […]

May 28th, 2024

Introduction to Linux and QU-Azure High Performance Computing (HPC) environment

What we will be learning? Introduction to Linux and High Performance Computing (HPC) Basic navigation in Linux Managing files/folders File […]

Rozaimi Razali

Department of Biomedical Sciences, College of Health Sciences, QU-Health, Qatar University