rosette128px1

Dispatch priority under Slurm

The Software Program Committee allocates resources during each program year. Each project is allocated a certain number of hours on various resources such as CPU core hours, GPU hours or time on an Institutional Cluster. We use a Slurm feature called QoS (Quality of Service) to manage access to partitions by projects and job dispatch priority. This is all part of maintaining a Fair Share usage of allocated resources.

Slurm prioritization

The only prioritization that is managed by Slurm is the dispatch or scheduling priority. All users submit their jobs to be run by Slurm on a particular resource, such as a Partition. On a billable or allocated partition, the projects that have allocated time available should run before those that do not have an allocation. This is true regardless of whether it is a Type A, B or C allocation. An unallocated project is said to be running opportunistically. On a non-billable or unallocated cluster, all projects are treated equally.

Partitions at Fermilab

We currently have a single partition within the Fermilab Lattice QCD Computing Facility. LQ1 cluster has the 'lq1csl' CPU computing partition. This is billable against an allocation. There are limits in place to make sure that at least two (in some cases 3) projects are active on a partition at any given time.

LQ1 cluster - Submit host: lq.fnal.gov

Name

Description

Billable

TotalNodes

MaxNodes

MaxTime

DefaultTime

lq1csl

LQ1 CPU resources

Yes

183

88

1-00:00:00

8:00:00

Slurm QoS defined at Fermilab

Jobs submitted to Slurm are associated with an appropriate QoS (or Quality of Service) configuration. Admins assign parameters to a QoS that are used to manage dispatch priority and resource use limits. Additional limits can be defined at the Account or Partition level.

Name

Description

Priority

GrpTRES

MaxWall

MaxJobsPU

MaxSubmitPA

admin

admin testing

600

 

 

 

 

test

quick tests of scripts

500

cpu=32

00:30:00

1

3

normal

Normal QoS (default)

300

 

 

 

250

opp

unallocated/opportunistic

0

 

08:00:00

 

125

The default QoS for all allocated projects using a billable partition is called normal. The default QoS for all jobs on a non-billable partition is called opp. Jobs running in this QoS are all dispatched at the same priority but will not start if there are normal jobs waiting in queue. Both of these run with a default wall-time limit of 8 hours. They both have a MaxWall limit of 24 hours.

We have defined a test QoS for users to run small test jobs to see that their scripts work and their programs run as expected. These test jobs run at a relatively high priority so that they will start as soon as nodes are available. Any user can have no more than three jobs submitted and no more than one job running at any given time. Test jobs are limited to 30 mins of wall-time and just two nodes (limit gpu=32).

The billable partitions also have a QoS defined as opp for opportunistic or unallocated running. This QoS has a simple priority of 0 (zero) with a wall-time limit of just 8 hours. Opportunistic jobs will only run when there are nodes sitting idle. When a project uses up all of the hours that they were allocated for the program year, their jobs will be limited to the opp QoS.

SLURM Commands to see current priorities

To see the list of jobs currently in queue by partition, visit our cluster status web page. Click on the "Start Time" column header to sort the table by start time. For running jobs, this is the actual time that the jobs started. Following that are the Pending jobs in the predicted order they will start.

From a command line, Slurm's 'squeue' command lists the jobs that are queued. It includes running jobs as well as those waiting to be started, aka dispatched. By changing the format of the commands output, one can get a lot of information about several things, such as:

  • Start time - actual or predicted
  • QoS the job is running under
  • Reason that the job is pending
  • Calculated dispatch real-time priority of the job

The following is just a sample output. Use your project name after the "-A" option to get a listing of jobs for your account.

[amitoj@lq ~]$ squeue -o "%.8a %.8u %.6i %.12j %.12P %.8q %.6Q %.2t %.s %.10S %.10l %R" --sort='-p' -A chiqcd
 ACCOUNT     USER  JOBID         NAME    PARTITION      QOS PRIORI ST  START_TIME TIME_LIMIT NODELIST(REASON)
  chiqcd  genwang  33013   48.0.00002       lq1csl   normal 119474 PD  2020-03-04   18:00:00 (AssocGrpNodeLimit)
  chiqcd  genwang  33014   48.4.00402       lq1csl   normal 119474 PD  2020-03-04   18:00:00 (AssocGrpNodeLimit)
  chiqcd  genwang  33015   48.8.00802       lq1csl   normal 119474 PD  2020-03-04   18:00:00 (AssocGrpNodeLimit)
  chiqcd  genwang  33016  48.12.01202       lq1csl   normal 119474 PD  2020-03-04   18:00:00 (AssocGrpNodeLimit)
  chiqcd  genwang  33017   48.0.00003       lq1csl   normal 119474 PD  2020-03-04   18:00:00 (AssocGrpNodeLimit)
  chiqcd  genwang  33018   48.4.00403       lq1csl   normal 119474 PD  2020-03-04   18:00:00 (AssocGrpNodeLimit)
  chiqcd  genwang  33019   48.8.00803       lq1csl   normal 119474 PD  2020-03-04   18:00:00 (AssocGrpNodeLimit)
  chiqcd  genwang  33020  48.12.01203       lq1csl   normal 119474 PD  2020-03-04   18:00:00 (AssocGrpNodeLimit)

Fermi National Accelerator Laboratory
Managed by Fermi Research Alliance, LLC
for the U.S. Department of Energy Office of Science
rosette128px1
Security, Privacy, Legal

 

 

 

 

peaceOpt2 footerFermilabLogo