FNAL - LQCD Documentation

New Users / Account Renewal

User Authentication

Kerberos and SSH Troubleshooting

Submitting jobs to the TORQUE Batch System

Project Allocations

Software Documentation Details

Hardware Details

Filesystem Details

Mass/Tape Storage Details

Transferring Files

Compilers

Using GPUs

FAQs

Contact Us

Quick Links

  1. Submitting and Monitoring a Batch Job
  2. Example PBS Batch Script
  3. Project Accounting
  4. Queue Policy and Queue Names
  5. Managing Data Files in Batch Scripts
  6. Determining job priorities

1. Submitting and Monitoring a Batch Job

User jobs are run in batch mode on the cluster's worker nodes. The batch system is the TORQUE Resource Manager a.k.a PBS or OpenPBS. A batch script is written to describe the sequence of programs that are to be run. Batch scripts are familiar shell scripts, written for bash, sh or csh, having additional PBS option lines, prefixed by #PBS, which are treated as shell comments. PBS options may be specified at the beginning of a batch script or they may be specified as command line options to the qsub command when the script is submitted to the batch system. Available options are described in the unix man documentation for qsub.

The example batch script below (Section 2) specifies the required number of nodes and walltime with the #PBS -l line, hence, they need not be repeated as arguments to the qsub command. Use qsub to submit the example script to PBS as follows:

$ qsub batch-example
8795.lqcd.fnal.gov

Use the qstat command to view queued jobs.

$ qstat
8793.lqcd fullqcd steve 0 Q workq
8795.lqcd cpi-example jim 00:00:01 R workq
8796.lqcd onia jim 00:00:16 R workq
8797.lqcd bigJob don 00:00:16 R workq

Status "R" for job cpi-example indicates it is running. Status "Q" indicates a job is queued and waiting to run. See the qstat man pages for a description of all the command line options available.

A graphical display of cluster status and running jobs is available via the web for all the Fermilab LQCD clusters here.

2. Example PBS Batch Script

The example bash script below runs the MPI example program cpi (see: /usr/local/mvapich/examples/cpi.c) in parallel on the J/Psi cluster. The source for this script is here: batch-example and output from a run is here: cpi-example.out.

In the example shown below, Lines 2 - 9 are PBS options.

Option "-S" selects bash as the batch job shell.

Option "-N" is the name this job will have in the queue.

Option "-j oe" redirects stream stderr to stdout.

Option "-o" sets the name of the file containing stdout from the job. Job output will be moved to this file when the batch job finishes.

Option "-m n" disables email notification of job status by PBS.

In line 7, "-l nodes=3,walltime=00:01:00" lists the resources this job is requesting. A job requiring more than one node must specify a value for nodes either in the batch file or when the job is submitted. A batch job must also specify the maximum walltime it expects to take to complete. A job exceeding the requested time limit will be terminated.

The "-A projectName" option is required and specifies the name of the project to be charged for cluster resource usage. Replace "projectName" with a vaid project name that you belong to. The list of current and valid project names is available here.

See the qsub documentation for descriptions of the environment variables such as PBS_O_WORKDIR and PBS_NODEFILE that PBS sets for batch jobs.

     1	#!/bin/bash
2 #PBS -S /bin/bash
3 #PBS -N cpi-example
4 #PBS -j oe
5 #PBS -o ./cpi-example.out
6 #PBS -m n
7 #PBS -l nodes=3,walltime=00:01:00
8 #PBS -A projectName
9 #PBS -q ds
10 cd ${PBS_O_WORKDIR}

11 # print identifying info for this job
12 echo "Job ${PBS_JOBNAME} submitted from ${PBS_O_HOST} started "`date`" jobid ${PBS_JOBID}"

13 export OFED_DIR=/usr/local/scidac-ofed-1.4.2/install
14 export QMP_DIR=$OFED_DIR/qmp
15 export QLA_DIR=$OFED_DIR/qla
16 export QIO_DIR=$OFED_DIR/qio
17 export QDP_DIR=$OFED_DIR/qdp
18 export MPI_DIR=/usr/local/mvapich-new
19 export GCC_DIR=/usr/local/gcc-4.3.2
20 export MILC_DIR=/usr/local/MILC

21 # determine number of cores per host
22 coresPerNode=`cat /proc/cpuinfo | grep -c processor`
23 # count the number of nodes listed in PBS_NODEFILE
24 nNodes=$[`cat ${PBS_NODEFILE} | wc --lines`]
25 (( nNodes= 0 + nNodes ))
26 (( nCores = nNodes * coresPerNode ))
27 echo "NODEFILE nNodes=$nNodes ($nCores cores):"
28 cat ${PBS_NODEFILE}

29 # Always stage (copy) any large input files from the cluster head node
30 # or Lustre to your job's control worker node.
31 # All worker nodes have attached disk storage in /scratch
32 # Copy of user data from Lustre as shown below is commented since there are no files to transfer.
33 # cp /lqcdproj/myDataArea/myInputDataFile /scratch

34 echo "example MPI program see: /usr/local/mvapich/examples/cpi.c"
35 #The directory the job was submitted from is $PBS_O_WORKDIR.
36 echo ${PBS_O_WORKDIR}

37 cp ${MPI_DIR}/examples/cpi ${PBS_O_WORKDIR}
38 application=${PBS_O_WORKDIR}/cpi

39 echo
40 cpus=$nodes
41 echo "=== Run MPI application on $nCores cores (for e.g. 32 cores per node on the ds cluster) ==="
42 $MPI_DIR/bin/mpirun_rsh -np $nCores $application

43 # Always copy any large result files you want to keep back
44 # to the head node or Lustre before exiting your script. The /scratch area on the
45 # workers is wiped clean between jobs.
46 #
47 # Copy below is commented since there are no files to transfer
48 # cp /scratch/myOutputDataFile /lqcdproj/myDataArea

49 exit

3. Project Accounting

Effective Wednesday, August 24th, 2005, all authorized users who submit batch jobs to the Fermilab Lattice QCD clusters will be required to specify their project name. An example of a command line job submission would be:

$ qsub -A project_name ...

or in a PBS job script file:

#!/bin/bash
#PBS -A project_name

mpirun -np $cpus $application
echo

exit

The PBS batch queue system will not honor jobs that are missing the -A project option.

$ qsub -l nodes=2 pbs.script
qsub: Invalid Account

All attempts to submit a job with the wrong project name will be rejected.

$ qsub -A invalid_project_name ...
qsub: Invalid Account

The following table, lists all current and valid project names. All project names should appear in a job submission the same way as listed in the allocation table (case sensitive).

4. Queue Policy and Queue Names

Users should submit jobs on the appropriate head node into one of the following 13 queues, depending upon which cluster is required, and which run policy fits their need:

Head Node Cluster Queues


Normal Background Test
pi0.fnal.gov pi0 pi
low_pi
test_pi
pi0.fnal.gov pi0g gpu
test_gpu
bc1.fnal.gov Bc bc
low_bc
test_bc
ds1.fnal.gov Dsg gpu

test_gpu
ds1.fnal.gov Ds ds
low_ds
test_ds

Normal: Normal queues should be used for all standard running. Jobs in these queues run at normal priority.

Background: The background queue should be used for background jobs that will be run when unused nodes are available. Such jobs will be scheduled at the lowest priority and will run only when there are no other jobs in the corresponding queue.

Test: The test queues should be used for high priority, short jobs that require modest resources. Test queue jobs have the highest priority and will run as soon as nodes are available. A maximum of 2 test queue jobs may execute at a time, but only two per user per cluster. All test queue jobs have a maximum walltime limit of 1 hour, and a limit of 64 cores (i.e. 2 nodes on Bc or Ds) and on the pi0g and Dsg cluster the limit is 2 nodes.

We regularly adjust queue limits in response to reasonable user requests. Please send such requests to lqcd-admin@fnal.gov. Use the following command to see a particular queue's settings:

$ qmgr -c "list queue test"
Queue test
queue_type = Execution
Priority = 4
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
max_running = 2
resources_max.nodect = 8
resources_max.walltime = 01:00:00
resources_default.walltime = 01:00:00
resources_assigned.nodect = 0
max_user_run = 1
enabled = True
started = True

See the pbs_server_attributes man page documentation for a description of the various server attributes.

5. Managing Data Files in Batch Scripts

The recommended way to read a large data file from within a batch job is to first stage (copy) the file to local disk attached to the worker node before your application opens the file. Each worker node has a staging area called /scratch. Note that the file is copied to disk on only the first node of a multinode job (i.e. the one running the batch script). The user must perform additional copies if it is necessary to distribute a file to all worker nodes of a job.

Applications should write their output files to /scratch. Before exiting, scripts should copy all output files to be kept back to the head node as shown in line 48 of the example above.

NOTE: All files on /scratch are deleted after the current job ends and before the next job begins. The various cluster worker nodes have scratch disk space available as follows: pi0 (436G), pi0g (436G), Bc (211G), Dsg (440G) and D/s (210G).

6. Determining job priorities

Torque a.k.a PBS or OpenPBS is the batch queue system which accepts, authenticates and submits jobs to the various queues. The Maui scheduler works in concert with Torque to schedule the running of jobs in queue. The Maui scheduler as currently configured calculates scheduling priorities as follows:

ACCOUNTWEIGHT*ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY

The weights are:

ACCOUNTWEIGHT   1000
CLASSWEIGHT 1000

Note: CLASS is a synonym for queue (-q on the qsub command) and the priorities are:

ACCOUNTCFG[DEFAULT] PRIORITY=100
ACCOUNTCFG[cdms] PRIORITY=1
ACCOUNTCFG[xqcd] PRIORITY=2

Note: cdms and xqcd have exceeded their allocation and get a low priority.
The following are priorities of the various queues.

CLASSCFG[test]        PRIORITY=300
CLASSCFG[workq] PRIORITY=200
CLASSCFG[low] PRIORITY=100

"/usr/local/maui/bin/diagnose -p" shows the priority calculation for each job which is eligible to run; i.e. a job not precluded from running because of node limits. The node limit is the total number of nodes, both in use (by jobs in R state) and requested (by jobs in Q state). Node limits can be exhuasted by a single user or by a project. If a project exhausts the node limits, then all users under the same project are affected. Please read the node usage limits section for a detailed explanation on node limits.

NOTE: As of Dec 2007 non-root users are not authorized to run the diagnose command and should use
/usr/local/bin/diagnose_priority
instead. If there is no output after running the command that
is not an error but indicates that there are no idle jobs in queue. If there is an error running the command
please email us at lqcd-admin@fnal.gov.

For example:

 [@lqcd ~]# /usr/local/maui/bin/diagnose -p
diagnosing job priority information (partition: ALL)

Job PRIORITY* Cred(Accnt:Class)
Weights -------- 1( 1000: 1000)

103808 300000 100.0(100.0:200.0)
103809 300000 100.0(100.0:200.0)
103647 200000 100.0(100.0:100.0)
103648 200000 100.0(100.0:100.0)

All of the jobs have the default account priority=100. Two jobs in the workq queue/class ( wt=100, priority=200) and two jobs in the low queue/class (priority=100). For job 103808 the calculation is;

ACCOUNTWEIGHT* ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY
1000 * 100 + 1000 * 200

/usr/local/maui/bin/diagnose -q shows the reason jobs are held. For example:

[@lqcd ~]# /usr/local/maui/bin/diagnose -q
Diagnosing blocked jobs (policylevel SOFT partition ALL)

job 112841 violates active SOFT MAXJOB limit of 1 for user syoon (R: 1, U: 4)
job 113723 violates active SOFT MAXJOB limit of 1 for user syoon (R: 1, U: 4)
job 113786 violates active SOFT MAXJOB limit of 40 for user simone (R: 1, U: 40)

Another useful command to check the status of your job is

[@lqcd ~]# /usr/local/maui/bin/checkjob job_number

Should you need further assistance in interpreting queue behavior please email us at lqcd-admin@fnal.gov

Node usage limits: A project and user is restricted on the usage of number of nodes per cluster.  There are two types of node limits per project and user, soft and hard. The soft limits apply when there are competing jobs in queue. Once the soft limits have reached and if there are available resources the hard limits apply. If a project reaches the limits then all users in the project are limited by the set limits. If a user reaches the limits then only the user is affected by the set limits. For e.g. let's say the soft and hard node limits per user are 10 and 32 for a 32-node cluster. In this case the scheduler will apply the 10 node limit on all jobs per user. After scheduling all jobs in queue if there are still available resources, the scheduler will launch additional jobs upto the 32 node limit per user. When unused nodes are available on any cluster, these node restrictions will not be enforced. The node restrictions for each cluster are summarized in the table below:

Cluster
Soft Limit
Hard Limit
Ds
96
256
Bc
64
224
pi
32
214
usqcd-webmaster@usqcd.org