Use SLURM scheduling system on Pi

SLURM (Simple Linux Utility for Resource Management) is a scalable workload manager wildely adopted by national supercomputer centers worldwide. It is free and opensource, released under the General Public License.

We are conducting trial use of SLURM on a small portion of resource on SJTU Pi supercomputer. Hopefully SLURM can replace LSF as the default job scheduler on March, 2016. This document will assit you to manage jobs via SLURM. More job samples can be found here.

The max walltime of a job is limited to 24 hours on SLURM. SLURM commands are available on all three login nodes: mu05, mu06 and mu07. To use software modules optimized for SLURM, please import proper environment modules. More job samples are available here. Slides for SJTU SLURM seminar can be found here.

$ unset MODULEPATH
$ module use /lustre/usr/modulefiles/pi/

Feel free to contact support@lists.hpc.sjtu.edu if we can be of any assistance.

SLURM Overview

LSF SLURM Function
sinfo Cluster state
bjobs squeue Queued job state
bsub sbatch Job submission
bjobs bqueues scontrol Monitor and modify jobs
bacct sacct Reports for completed jobs
sreport Reports for job usage and cluster utilization
bkill [JOB_ID] scancel [JOB_ID] Job deletion
sview smap SLURM UI

sinfo: check cluster status

LSF SLURM Function
bhosts sinfo -N Show node-level info
sinfo -N --states=idle alloc
bqueues QUEUE sinfo --partition=QUEUE Show info for QUEUE
sinfo --help Show all options

Host states include drain(something wrong), alloc(in use), idle, and down.

To check the overall resource state:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu*         up 1-00:00:00      1  drain node001
cpu*         up 1-00:00:00     31  alloc node[002-032]
gpu          up 1-00:00:00      4  alloc gpu[47-50]
fat          up 1-00:00:00      2  alloc fat[19-20]
k40          up 1-00:00:00      2  alloc mic[01-02]
k40          up 1-00:00:00      2   idle mic[03-04]
fail         up 2-00:00:00      1  down* node222

To check resource state on the cpu partition:

$ sinfo -p cpu
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu*         up   infinite     32  down* node[001-032]

To check resource state at the host level:

$ sinfo -N
NODELIST                   NODES PARTITION STATE
fat[15-18]         4       upc alloc
fat[19-20]         2       fat alloc
gpu[45-46]         2       gpu drain
gpu[47-50]         4       gpu alloc
mic[01-04]         4       k40 alloc
mic05              1       mic idle
node001            1      cpu* drain
node[002-032]     31      cpu* alloc
node222            1      fail down*

squeue: check states of queued job

LSF SLURM Function
bjobs JOB_ID squeue -j JOB_ID Show job info about
bjobs -l squeue -l Show detailed info
bjobs -m HOST squeue -n HOST Show job info allocated to the specified HOST
squeue -A ACCOUNT_LIST Show ACCOUNT_ID’s jobs
bjobs -u USER_LIST squeue -u USER_LIST Show USER_LIST’s jobs
bjobs -r|p|s squeue --states=R|PD|CG|CD Show jobs with specific states.
squeue --start List estimated start time for queued jobs
squeue --format="LAYOUT" Customize squeue output with given LAYOUT
bjobs -h squeue --help Show all options

Job states include R(running), PD(pending), CG(completing) and CD(completed).

By default, squeue shows queued or running jobs of all users.

$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  331       cpu     bash hpcinter  R   17:50:58      1 node143
  339       k80   lammps  hpceric  R    1:10:33      1 gpu50
  340       k80   lammps  hpceric  R      30:15      1 gpu49

To list the jobs of your own:

$ squeue -u `whoami`
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  339       k80   lammps  hpceric  R    1:10:33      1 gpu50
  340       k80   lammps  hpceric  R      30:15      1 gpu49

-l option addes more details to squeue outputs.

$ squeue -u `whoami` -l
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  339       k80   lammps  hpceric  R    1:10:33      1 gpu50
  340       k80   lammps  hpceric  R      30:15      1 gpu49

sbatch: job submission

Preparing a job script then submitting via sbatch is most common use of SLURM. To feed a job script to the job scheduling system, SLURM uses

$ sbatch jobscript.slurm

compared to LSF,

$ bsub < jobscript.lsf

SLURM has a rich set of parameters. Here are most commonly used ones.

LSF SLURM Meaning
-n [count] -n [count] Total processes
-R "span[ptile=count]" --ntasks-per-node=[count] Processes per host
-q [queue] -p [partition] Job queue/partition
-J [name] --job-name=[name] Job name
-o [file_name] --output=[file_name] Standard output file
-e [file_name] --error=[file_name] Standard error file
-W [hh:mm:ss] --time=[dd-hh:mm:ss] Max walltime
-x --exclusive Use the hosts exclusively
-mail-type=[type] Notification type
-u [mail_address] --mail-user=[mail_address] Email for notification
-m [nodes]" --nodelist=[nodes] Job host preference
-R "hname!=hosta && hname!=hostb" --exclude=[nodes] Job host to avoid
-w 'state(job_id)' --depend=[state:job_id] Job dependency
-J "name[array_spec]" --array=[array_spec] Job array

Here is a job script named cpu.slurm, which requests 1 core one the cpu partition, sets walltime limit to 10 secs and notifies at job completion. The command executed in this job is /bin/hostname.

#SBATCH --job-name=hostname
#SBATCH --partition=cpu
#SBATCH -n 1
#SBATCH --mail-type=end
#SBATCH --mail-user=YOU@EMAIL.COM
#SBATCH --output=%j.out
#SBATCH --error=%j.err
#SBATCH --time=00:00:10

/bin/hostname

This job can be submitted to SLURM via

$ sbatch cpu.slurm

squeue can be used to check job status. Users can login to the compute nodes via SSH during job execution. Outputs will be updated in real time to the files [jobid].out and [jobid].err.

A more complex job requirement is illustrated here, in which 64 processes will be started with 16 processes per host.

#SBATCH --job-name=LINPACK
#SBATCH --partition=cpu
#SBATCH -n 64
#SBATCH --ntasks-per-node=16
#SBATCH --mail-type=end
#SBATCH --mail-user=YOU@EMAIL.COM
#SBATCH --output=%j.out
#SBATCH --error=%j.err
#SBATCH --time=00:20:00

The following job requests 8 GPU cards, with one CPU process managing one GPU card. Since each GPU node has 2 GPU cards, 2 CPU processes are started per node.

#SBATCH --job-name=GPU_HPL
#SBATCH --partition=k40
#SBATCH -n 8
#SBATCH --ntasks-per-node=2
#SBATCH --exclusive
#SBATCH --mail-type=end
#SBATCH --mail-user=YOU@MAIL.COM
#SBATCH --output=%j.out
#SBATCH --error=%j.err
#SBATCH --time=00:30:00

The following job starts an 3-task array (from 0 to 2), each requiring 1 CPU core. Regarding Python on Pi, you can consult our Python document.

#!/bin/bash

#SBATCH --job-name=python_array
#SBATCH --mail-user=YOU@MAIL.COM
#SBATCH --mail-type=ALL
#SBATCH --ntasks=1
#SBATCH --time=00:30:00
#SBATCH --array=0-2
#SBATCH --output=python_array_%A_%a.out
#SBATCH --output=python_array_%A_%a.err

source /usr/share/Modules/init/bash
unset MODULEPATH
module use /lustre/usr/modulefiles/pi
module purge
module load gcc openblas python/2.7

VIRTUAL_ENV_DISABLE_PROMPT=1
source ~/python27-hpc-gcc/bin/activate

echo "SLURM_JOBID: " $SLURM_JOBID
echo "SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
echo "SLURM_ARRAY_JOB_ID: " $SLURM_ARRAY_JOB_ID

python < vec_${SLURM_ARRAY_TASK_ID}.py

srun: lanch interactive jobs

srun can launch interactive jobs. This operation will block until completion or being terminated. For example, run hostname on a compute host.

$ srun -N1 -n1 hostname
node216

Launch the bash shell on a remote host.

$ srun -N1 -n1 /bin/bash
hostname
node216
free
             total       used       free     shared    buffers     cached
Mem:      65903880    1885592   64018288          0     154420     184700
-/+ buffers/cache:    1546472   64357408
Swap:     65535992      16848   65519144
CTRL-D

scontrol: monitor and modify queued jobs

LSF SLURM Function
bjobs JOB_ID scontrol show job JOB_ID Show information for queued or running job
scontrol -dd show job JOB_ID Show batch job scripts
bstop JOB_ID scontrol hold JOB_ID Pause JOB_ID
bresume JOB_ID scontrol release JOB_ID Resuem JOB_ID
scontrol update JobID=JOB_ID Timelimit=1-12:00:00 Change wall time to 1 day 12 hours (available for pending jobs only)
scontrol update dependency=JOB_ID Add job dependency so that job only starts after JOB_ID completes
scontrol --help Show all options

sacct: view job accounting info

LSF SLURM Function
bacct -l sacct -l Show detailed accountig info
sacct -A ACCOUNT_LIST Show ACCOUNT_ID’s accounting info
bacct -u USER_NAME sacct -u USER_NAME Show USER_NAME’s accounting info
bacct -u all sacct --allusers Show all users’ job accounting info
sacct --states=R|PD|CG|CD Show accounting info for jobs with specific states
bacct -S time0,time1 sacct -S YYYY-MM-DD Select jobs in any state after the specified time
sacct --format="LAYOUT" Customize sacct output with given LAYOUT
bacct -h sacct --help Show all options

By default, sacct display accounting info for the past ** 24 hours **.

$ sacct

Display more infomation:

$ sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 3224

SLURM’s graphical interface sview and smap

SLURM has a built-in GUI named sview. It can be evoked from the X Window System.

$ ssh -Y mu05
mu05$ sview -i 2 &

sview

smap displays info on the terminal. smap -i 2 will refresh every two seconds. Press q key to exit.

$ smap -i 2

sreport: generate reports for jobs or clusters

SLURM Function
sreport cluster utilization Show cluster utilization report
sreport user top Show top 10 cluster users based on total CPU time in the past 24 hours
sreport cluster AccountUtilizationByUser start=2014-12-01 Show account usage per user since Dec 1st, 2014.
sreport job sizesbyaccount PrintJobCount Show number of jobs run on a per-group basis
sreport --help Show all options

By default, sreport uses statistics in the past 24 hours.

SLURM Environment Variables

LSF SLURM Function
$LSB_JOBID $SLURM_JOB_ID Job ID
$LSB_JOBNAME $SLURM_JOB_NAME Job name
$LSB_QUEUE $SLURM_JOB_PARTITION Name of the queue or partition
$LSB_DJOB_NUMPROC $SLURM_NTASKS The total number of processes
$SLURM_NTASKS_PER_NODE Number of tasks requested per node
$LSB_HOSTS $SLURM_JOB_NUM_NODES Number of nodes
$LSB_HOSTS $SLURM_JOB_NODELIST A list of nodes
$SLURM_LOCALID Node local task ID for the process within the job
$LSB_JOBINDEX $SLURM_ARRAY_TASK_ID Task id within job array
$LSB_SUBCWD $SLURM_SUBMIT_DIR Working directory
$LSB_SUB_HOST $SLURM_SUBMIT_HOST Hostname from which the job is submitted

Reference