SLURMDispatchPriorities

Home User Accounts Allocations SLURM Hardware Software Filesystems Tape Storage Cluster Status Contact


Job dispatch explained Kerberos & SSH Troubleshooting Tips Data Management Policy Globus Online FAQs

Dispatch priority under Slurm

The Software Program Committee allocates resources during each program year. Each project is allocated a certain number of hours on various resources such as CPU core hours, GPU hours or time on an Institutional Cluster. We use a Slurm feature called QoS (Quality of Service) to manage access to partitions by projects and job dispatch priority. This is all part of maintaining a Fair Share usage of allocated resources.

Slurm prioritization

The only prioritization that is managed by Slurm is the dispatch or scheduling priority. All users submit their jobs to be run by Slurm on a particular resource, such as a Partition. On a billable or allocated partition, the projects that have allocated time available should run before those that do not have an allocation. This is true regardless of whether it is a Type A, B or C allocation. An unallocated project is said to be running opportunistically.

Partitions at Fermilab

We currently have a single partition within the Fermilab Lattice QCD Computing Facility. LQ1 cluster has the 'lq1csl' CPU computing partition. This is billable against an allocation. There are limits in place to make sure that at least two (in some cases 3) projects can be active at any given time.

LQ1 cluster - Submit host: lq.fnal.gov

Name	Description	Billable	TotalNodes	MaxNodes	MaxTime	DefaultTime
lq1csl	LQ1 CPU CascadeLake	Yes	183	64	1-00:00:00	8:00:00

Slurm QoS defined at Fermilab

Jobs submitted to Slurm are associated with an appropriate QoS (or Quality of Service) configuration. Admins assign parameters to a QoS that are used to manage dispatch priority and resource use limits. Additional limits can be defined at the Account or Partition level.

Name	Description	Priority	GrpTRES	MaxWall	MaxJobsPU	MaxSubmitPA
admin	admin testing	600
test	quick tests of scripts	500	cpu=80	00:30:00	1	3
normal	Normal QoS (default)	250				125
opp	unallocated/opportunistic	10		08:00:00		125

The default QoS for all allocated projects is called normal. The default QoS for all projects without a current allocation is called opp (Opportunistic). Jobs running in this QoS are all dispatched at the same priority but will not start if there are normal jobs waiting in queue. Both of these run with a default wall-time limit of 8 hours. The normal QoS has a MaxWall limit of 24 hours.

We have defined a test QoS for users to run small test jobs to see that their scripts work and their programs run as expected. These test jobs run at a relatively high priority so that they will start as soon as nodes are available. Any user can have no more than three jobs submitted and no more than one job running at any given time. Test jobs are limited to 30 mins of wall-time and just two nodes (limit gpu=80).

We also have a QoS defined as opp for opportunistic or unallocated running. This QoS has a simple priority of 10 with a wall-time limit of just 8 hours. Opportunistic jobs will only run when there are nodes sitting idle. When a project uses up all of the hours that they were allocated for the program year, their jobs will be limited to the opp QoS.

SLURM Commands to see current priorities

To see the list of jobs currently in queue by partition, visit our cluster status web page. Click on the "Start Time" column header to sort the table by start time. For running jobs, this is the actual time that the jobs started. Following that are the Pending jobs in the predicted order they will start.

From a command line, Slurm's 'squeue' command lists the jobs that are queued. It includes running jobs as well as those waiting to be started, aka dispatched. By changing the format of the commands output, one can get a lot of information about several things, such as:

Start time - actual or predicted
QoS the job is running under
Reason that the job is pending
Calculated dispatch real-time priority of the job

The following is just a sample output. Use your project name after the "-A" option to get a listing of jobs for your account.

--(kschu@lq)-(~)--
--(130)> squeue -o "%.8a %.8u %.6i %.12j %.12P %.8q %.6Q %.2t %.s %.10S %.10l %R" --sort='-p' -A mslight
 ACCOUNT     USER  JOBID         NAME    PARTITION      QOS PRIORI ST  START_TIME TIME_LIMIT NODELIST(REASON)
 mslight  bazavov 105194     t144b67a       lq1csl   normal 170116  R  2020-11-23    9:00:00 lq1wn[018,021,023,043,045,073-074,107,165-166,168-170]
 mslight  bazavov 105195     t144b67a       lq1csl   normal 168838 PD         N/A    9:00:00 (Dependency)
 mslight  bazavov 105196     t144b67a       lq1csl   normal 168838 PD         N/A    9:00:00 (Dependency)
 mslight  bazavov 105197     t144b67a       lq1csl   normal 168838 PD         N/A    9:00:00 (Dependency)
 mslight  bazavov 105198     t144b67a       lq1csl   normal 168838 PD         N/A    9:00:00 (Dependency)

Fermi National Accelerator Laboratory

Managed by Fermi Research Alliance, LLC

for the U.S. Department of Energy Office of Science

Security, Privacy, Legal