rosette128px1

Dispatch priority under Slurm

The Software Program Committee allocates resources during each program year. Each project is allocated a certain number of hours on various resources such as CPU core hours, GPU hours or time on an Institutional Cluster. We use a Slurm feature called QoS (Quality of Service) to manage access to partitions by projects and job dispatch priority. This is all part of maintaining a Fair Share usage of allocated resources.

Slurm prioritization

The only prioritization that is managed by Slurm is the dispatch or scheduling priority. All users submit their jobs to be run by Slurm on a particular resource, such as a Partition. On a billable or allocated partition, the projects that have allocated time available should run before those that do not have an allocation. This is true regardless of whether it is a Type A, B or C allocation. An unallocated project is said to be running opportunistically.

Partitions at Fermilab

We currently have a single partition within the Fermilab Lattice QCD Computing Facility. LQ1 cluster has the 'lq1csl' CPU computing partition. This is billable against an allocation. There are limits in place to make sure that at least two (in some cases 3) projects can be active at any given time.

LQ1 cluster - Submit host: lq.fnal.gov

Name

Description

Billable

TotalNodes

MaxNodes

MaxTime

DefaultTime

lq1csl

LQ1 CPU CascadeLake

Yes

183

64

1-00:00:00

8:00:00

Slurm QoS defined at Fermilab

Jobs submitted to Slurm are associated with an appropriate QoS (or Quality of Service) configuration. Admins assign parameters to a QoS that are used to manage dispatch priority and resource use limits. Additional limits can be defined at the Account or Partition level.

Name

Description

Priority

GrpTRES

MaxWall

MaxJobsPU

MaxSubmitPA

admin

admin testing

600

 

 

 

 

test

quick tests of scripts

500

cpu=80

00:30:00

1

3

normal

Normal QoS (default)

250

 

 

 

125

opp

unallocated/opportunistic

10

 

08:00:00

 

125

The default QoS for all allocated projects is called normal. The default QoS for all projects without a current allocation is called opp (Opportunistic). Jobs running in this QoS are all dispatched at the same priority but will not start if there are normal jobs waiting in queue. Both of these run with a default wall-time limit of 8 hours. The normal QoS has a MaxWall limit of 24 hours.

We have defined a test QoS for users to run small test jobs to see that their scripts work and their programs run as expected. These test jobs run at a relatively high priority so that they will start as soon as nodes are available. Any user can have no more than three jobs submitted and no more than one job running at any given time. Test jobs are limited to 30 mins of wall-time and just two nodes (limit gpu=80).

We also have a QoS defined as opp for opportunistic or unallocated running. This QoS has a simple priority of 10 with a wall-time limit of just 8 hours. Opportunistic jobs will only run when there are nodes sitting idle. When a project uses up all of the hours that they were allocated for the program year, their jobs will be limited to the opp QoS.

SLURM Commands to see current priorities

To see the list of jobs currently in queue by partition, visit our cluster status web page. Click on the "Start Time" column header to sort the table by start time. For running jobs, this is the actual time that the jobs started. Following that are the Pending jobs in the predicted order they will start.

From a command line, Slurm's 'squeue' command lists the jobs that are queued. It includes running jobs as well as those waiting to be started, aka dispatched. By changing the format of the commands output, one can get a lot of information about several things, such as:

  • Start time - actual or predicted
  • QoS the job is running under
  • Reason that the job is pending
  • Calculated dispatch real-time priority of the job

The following is just a sample output. Use your project name after the "-A" option to get a listing of jobs for your account.

--(kschu@lq)-(~)--
--(130)> squeue -o "%.8a %.8u %.6i %.12j %.12P %.8q %.6Q %.2t %.s %.10S %.10l %R" --sort='-p' -A mslight
 ACCOUNT     USER  JOBID         NAME    PARTITION      QOS PRIORI ST  START_TIME TIME_LIMIT NODELIST(REASON)
 mslight  bazavov 105194     t144b67a       lq1csl   normal 170116  R  2020-11-23    9:00:00 lq1wn[018,021,023,043,045,073-074,107,165-166,168-170]
 mslight  bazavov 105195     t144b67a       lq1csl   normal 168838 PD         N/A    9:00:00 (Dependency)
 mslight  bazavov 105196     t144b67a       lq1csl   normal 168838 PD         N/A    9:00:00 (Dependency)
 mslight  bazavov 105197     t144b67a       lq1csl   normal 168838 PD         N/A    9:00:00 (Dependency)
 mslight  bazavov 105198     t144b67a       lq1csl   normal 168838 PD         N/A    9:00:00 (Dependency)

Fermi National Accelerator Laboratory
Managed by Fermi Research Alliance, LLC
for the U.S. Department of Energy Office of Science
rosette128px1
Security, Privacy, Legal

 

 

 

 

peaceOpt2 footerFermilabLogo