FNAL - LQCD Documentation

New Users / Account Renewal

User Authentication

Kerberos and SSH Troubleshooting

Project Allocations

SLURM Batch System

Software Details

Hardware Details

Filesystem Details

Mass/Tape Storage Details

Globus Online

FAQs

LIVE Cluster Status

Contact Us

SLURM (Simple Linux Utility For Resource Management) is a very powerful open source, fault-tolerant, and highly scalable resource manager and job scheduling system of high availability currently developed by SchedMD. Initially developed for large Linux Clusters at the Lawrence Livermore National Laboratory, SLURM is used extensively on most Top 500 supercomputers around the globe.

  1. Commands
  2. User Accounts
  3. Resource Types
  4. Using SLURM: examples
  5. SLURM Reporting
  6. Comparison between PBS/Torque and SLURM
  7. SLURM Environment Variables
  8. Binding and Distribution of Tasks
  9. More Information

The following graphic depicts the batch and filesystem layout of the Fermilab LQCD clusters.

SLURM Commands

  • Job control and monitoring are performed by scontrol and squeue.
  • Batch jobs are submitted with sbatch.
  • Interactive job sessions are requested through salloc.
  • The command to launch a job is srun.
  • Nodes info and cluster status may be requested with sinfo.
  • Job and job steps accounting data can be accessed with sacct.
  • Useful environment variables are $SLURM_NODELIST and $SLURM_JOBID.

SLURM User Accounts

In order to check your "default" SLURM account use the following command:

[@lattice ~]$ sacctmgr list user name=johndoe
      User   Def Acct     Admin
---------- ---------- ---------
   johndoe   projectx      None

To check "all" the SLURM accounts you are associated with use the following command:

[@lattice ~]$ sacctmgr list user name=johndoe withassoc
      User   Def Acct     Admin    Cluster    Account  Partition     
---------- ---------- --------- ---------- ---------- ---------- 
   johndoe  projectx      None     wilson      alpha
   johndoe  projectx      None     wilson       beta
   johndoe  projectx      None     wilson      gamma

NOTE: If you do not specify an account name during your job submission (using --account), the "default" account will be used to track usage.

SLURM Resource Types

There are currently four types of CPU resources and one type of GPU resource that can be requested.

SLURM Partition (or queue name)Resource TypeDescriptionNumber of resourcesNumber of tasks per resourceGPU resources per node
--partition--nodes--ntasks-per-node--gres
piCPU2.6GHz Intel E5-2650v2 "Ivy Bridge", 128GB memory per node (8GB/core) ,QDR Infiniband31416
pigpuGPU2.6GHz Intel E5-2650v2 "Ivy Bridge", 128GB memory per node (8GB/core), QDR Infiniband, 4 NVidia Tesla K40m GPUs per node32164
bcCPU2.8GHz AMD 6320 Opteron, 64GB memory per node (2GB/core), QDR Infiniband22432
dsCPU2GHz AMD 6128 Opteron, 64GB memory per node (2GB/core), QDR Infiniband19632
exdsgCPU2.53GHz Intel Xeon E5630, 48GB memory per node (6GB/core), QDR Infiniband208

LIVE cluster status is available at http://www.usqcd.org/fnal/clusterstatus

Using SLURM: examples

  • Submit an interactive job requesting 12 "pi" nodes
    [@lattice:~]$ srun --pty --nodes=12 --ntasks-per-node=16 --partition pi bash
    [user@pi111:~]$ env | grep NTASKS
    SLURM_NTASKS_PER_NODE=16
    SLURM_NTASKS=192
    [user@pi111:~]$ exit
    

  • Submit an interactive job requesting two "pigpu" nodes (or 4 GPUs/node)
    [@lattice:~]$ srun --pty --nodes=2 --partition pigpu --gres=gpu:4 bash
    [@pig607:~]$ PBS_NODEFILE=`generate_pbs_nodefile`
    [@pig607:~]$ rgang --rsh=/usr/bin/rsh $PBS_NODEFILE nvidia-smi -L
    pig607= 
    GPU 0: Tesla K40m (UUID: GPU-2fe2a84f-3de9-2ca0-60f0-db011d53a20c)
    GPU 1: Tesla K40m (UUID: GPU-9afce23b-cfdf-2318-ed00-2b23c14337f1)
    GPU 2: Tesla K40m (UUID: GPU-782960ea-d854-e6ee-26ce-363a4c9c01e2)
    GPU 3: Tesla K40m (UUID: GPU-ee804701-10ac-919e-ae64-27888dcb4645)
    pig608= 
    GPU 0: Tesla K40m (UUID: GPU-b20a4059-56c2-b36a-ba31-1403fa6de2dc)
    GPU 1: Tesla K40m (UUID: GPU-af290605-caeb-50e8-a4ca-fd533098c302)
    GPU 2: Tesla K40m (UUID: GPU-16ab19e4-9835-5eb2-9b8b-1e479753d20b)
    GPU 3: Tesla K40m (UUID: GPU-2b3d082e-3113-617a-dcc6-26eee33e3b2d)
    [@pig607:~]$exit
    

  • Submit a batch job requesting 4 GPUs i.e. one "pigpu" nodes
    [@lattice ~]$ cat myscript.sh
    #!/bin/sh
    #SBATCH --job-name=test
    #SBATCH --partition=pigpu
    #SBATCH --nodes=1
    #SBATCH --gres=gpu:4
    
    nvidia-smi -L
    sleep 5
    exit
    
    [@lattice ~]$ sbatch myscript.sh
    Submitted batch job 46
    
    [@lattice ~]$ squeue
                  JOBID PARTITION     NAME     USER ST       TIME NODES NODELIST(REASON)
                     46     pigpu     test   amitoj  R       0:03     2 pig[607-608]
    

    Once the batch job completes the output is available as follows.

    [@lattice ~]$ cat slurm-46.out
    GPU 0: Tesla K40m (UUID: GPU-2fe2a84f-3de9-2ca0-60f0-db011d53a20c)
    GPU 1: Tesla K40m (UUID: GPU-9afce23b-cfdf-2318-ed00-2b23c14337f1)
    GPU 2: Tesla K40m (UUID: GPU-782960ea-d854-e6ee-26ce-363a4c9c01e2)
    GPU 3: Tesla K40m (UUID: GPU-ee804701-10ac-919e-ae64-27888dcb4645)
    

SLURM reporting

sreport is used to generate reports of job usage and cluster utilization for SLURM jobs saved to the SLURM Database. sreport should be run on the host lattice.fnal.gov. A few worthwhile examples follow:

[@lattice ~]$ sreport cluster AccountUtilizationByUser user=johndoe start=2018-03-01 -t percent
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2018-03-01T00:00:00 - 2018-04-25T23:59:59 (4834800 secs)
Use reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name     Used   Energy
--------- --------------- --------- --------------- -------- --------
     lqcd           alpha   johndoe        John Doe   56.27%    0.00%

[@lattice ~]$ sreport user Top Start=04/20/18 End=04/26/18 -t percent
--------------------------------------------------------------------------------
Top 10 Users 2018-04-20T00:00:00 - 2018-04-25T23:59:59 (518400 secs)
Use reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account     Used   Energy
--------- --------- --------------- --------------- -------- --------
     lqcd   johndoe        John Doe           alpha   56.27%    0.00%
     lqcd   janedoe        Jane Doe          matter   30.16%    0.00%
.....
<--snipped--->

Comparison between PBS/Torque and SLURM



PBS/Torque

SLURM

submit command

qsub

sbatch, srun, salloc




walltime request

#PBS -l walltime hh:mm:ss

#SBATCH --time=hh:mm:ss

(or -t hh:mm:ss)

specific node request

#PBS -l nodes=X:ppn=Y:gpus=2

#SBATCH --nodes=X (or -N X)

#SBATCH --cpus-per-task=Y(or -c Y)[for OpenMP or hybrid code]

#SBATCH --ntasks-per-node=Y(or -n Y)[to set equal # of tasks/node or MPI]
#SBATCH --n Z [set total number of tasks]

#SBATCH --gres=gpu:2

define memory

#PBS -l mem=Zgb

#SBATCH --mem=Zgb

define number of procs/node


#SBATCH -c <# of cpus/task> -> for OpenMP/hybrid jobs
#SBATCH -n <# of total tasks or processors -> for MPI jobs

queue request

#PBS -q gpu

#SBATCH -p gpu

group account

#PBS -A <Account>

#SBATCH -A <Account>

job name

#PBS -N <name>

#SBATCH -J <name>

output file name

#PBS -O <filename>

#SBATCH -o <name>.o%j

where %j is the jobID

email option

#PBS -m e

#SBATCH --mail-type=end

options: begin,end,fail,all

email address

#PBS -M <email address>

#SBATCH --mail-user=<email>

count processors

NPROCS=`wc -l < $PBS_NODEFILE`NPROCS=$(( $SLURM_NNODES * $SLURM_CPUS_PER_TASK ))#for OpenMP & hybrid (MPI + OpenMP) jobs

NPROCS=$SLURM_NPROCS or $SLURM_NTASKS #for MPI jobs
Note: -N 4 -n 16 => 16 processors NOT 4*16=64 processors

if $PBS_NODEFILE is needed, then include the following lines:

PBS_NODEFILE=`generate_pbs_nodefile`
NPROCS=`wc -l < $PBS_NODEFILE`

starting directory on the compute node

user home directory

the working (submit) directory

node names

piXXX or pigXXXX

piXXXX or pigXXXX

interactive job request

qsub -I -X

srun --pty /bin/bash

reserve the node exclusively

srun --exclusive --pty /bin/bash

dependency

#PBS -d <jobid>

#SBATCH -d <after:jobid>

SLURM Environment Variables

Slurm Job Environment Variables
Slurm Variable Name Description Example values PBS/Torque analog
$SLURM_JOB_ID Job ID 5741192 $PBS_JOBID
$SLURM_JOBID Deprecated. Same as SLURM_JOB_ID    
$SLURM_JOB_NAME Job Name myjob $PBS_JOBNAME
$SLURM_SUBMIT_DIR Submit Directory /project/charmonium $PBS_O_WORKDIR
$SLURM_JOB_NODELIST Nodes assigned to job pi1[01-05] cat $PBS_NODEFILE
$SLURM_SUBMIT_HOST Host submitted from lattice.fnal.gov $PBS_O_HOST
$SLURM_JOB_NUM_NODES Number of nodes allocated to job 2 $PBS_NUM_NODES
$SLURM_CPUS_ON_NODE Number of cores/node 8,3 $PBS_NUM_PPN
$SLURM_NTASKS Total number of cores for job 11 $PBS_NP
$SLURM_NODEID Index to node running on
relative to nodes assigned to job
0 $PBS_O_NODENUM
$PBS_O_VNODENUM Index to core running on
within node
4 $SLURM_LOCALID
$SLURM_PROCID Index to task relative to job 0 $PBS_O_TASKNUM - 1
$SLURM_ARRAY_TASK_ID Job Array Index 0 $PBS_ARRAYID

Binding and Distribution of tasks

More Information

usqcd-webmaster@usqcd.org