FNAL - LQCD Documentation

New Users

User Authentication

Building your code - The Runtime Environment

Submitting jobs to the TORQUE Batch System

Project Allocations

Software Documentation Details

Hardware Details

Filesystem Details

Mass/Tape Storage Details

Transferring Files

Compilers

FAQs

Quick Links

  1. Submitting and Monitoring a Batch Job
  2. Example PBS Batch Script
  3. Project Accounting
  4. Queue Policy and Queue Names
  5. Managing Data Files in Batch Scripts
  6. Determining job priorities

1. Submitting and Monitoring a Batch Job

User jobs are run in batch mode on the cluster's worker nodes. The batch system is the TORQUE Resource Manager a.k.a PBS or OpenPBS. A batch script is written to describe the sequence of programs that are to be run. Batch scripts are familiar shell scripts, written for bash, sh or csh, having additional PBS option lines, prefixed by #PBS, which are treated as shell comments. PBS options may be specified at the beginning of a batch script or they may be specified as command line options to the qsub command when the script is submitted to the batch system. Available options are described in the unix man documentation for qsub.

The example batch script below specifies the required number of nodes and walltime with the #PBS -l line, hence, they need not be repeated as arguments to the qsub command. Use qsub to submit the example script to PBS as follows:

$ qsub batch-example
8795.lqcd.fnal.gov

Use the qstat command to view queued jobs.

$ qstat
8793.lqcd fullqcd steve 0 Q workq
8795.lqcd cpi-example jim 00:00:01 R workq
8796.lqcd onia jim 00:00:16 R workq
8797.lqcd bigJob don 00:00:16 R workq

Status "R" for job cpi-example indicates it is running. Status "Q" indicate a job is queued and waiting to run. See the qstat man pages for a descrition of all the command line options available.

A graphical display of cluster status and running jobs is available via the web for the jpsi and kaon clusters.

2. Example PBS Batch Script

The example bash script below runs the MPI example program cpi (see: /usr/local/mvapich/example/cpi.c) in parallel. The source for this script is here: batch-example and output from a run is here: cpi-example.out.

In the example shown below, Lines 2 - 9 are PBS options.

Option "-S" selects bash as the batch job shell.

Option "-N" is the name this job will have in the queue.

Option "-j oe" redirects stream stderr to stdout.

Option "-o" sets the name of the file containing stdout from the job. Job output will be moved to this file when the batch job finishes.

Option "-m n" disables email notification of job status by PBS.

In line 7, "-l nodes=3,walltime=00:01:00" lists the resources this job is requesting. A job requiring more than one node must specify a value for nodes either in the batch file or when the job is submitted. A Batch job must also specify the maximum walltime it expects to take to complete. A job exceeding the requested time limit will be terminated.

The "-A projectName" option is required and specifies the name of the project to be charged for cluster resource usage. Replace "projectName" with a vaid project name that you belong to.

See the qsub documentation for descriptions of the environment variables such as PBS_O_WORKDIR and PBS_NODEFILE that PBS sets for batch jobs.

    #! /bin/bash
#PBS -S /bin/bash
#PBS -N cpi-example
#PBS -j oe
#PBS -o ./cpi-example.out
#PBS -m n
#PBS -l nodes=3,walltime=00:01:00
#PBS -A projectName
#PBS -q jpsi
cd ${PBS_O_WORKDIR}

# print identifying info for this job
echo "Job ${PBS_JOBNAME} submitted from ${PBS_O_HOST} started "`date`" jobid ${PBS_JOBID}"

export OFED_DIR=/usr/local/scidac-ofed-1.4.2/install
export QMP_DIR=$OFED_DIR/qmp
export QLA_DIR=$OFED_DIR/qla
export QIO_DIR=$OFED_DIR/qio
export QDP_DIR=$OFED_DIR/qdp
export MPI_DIR=/usr/local/mvapich-new
export GCC_DIR=/usr/local/gcc-4.3.2
export MILC_DIR=/usr/local/MILC

# determine number of cores per host
coresPerNode=`cat /proc/cpuinfo | grep -c processor`
# count the number of nodes listed in PBS_NODEFILE
nNodes=$[`cat ${PBS_NODEFILE} | wc --lines`]
(( nNodes= 0 + nNodes ))
(( nCores = nNodes * coresPerNode ))
echo "NODEFILE nNodes=$nNodes ($nCores cores):"
cat ${PBS_NODEFILE}

# Always use fcp or rsync to stage any large input files from the cluster head node
# to your job's control worker node.
# All worker nodes have attached disk storage in /scratch
# Copy below is commented since there are no files to transfer
# fcp -c rsh -p cluster-head-node.fnal.gov:/data/raid1/data/myDataArea/myInputDataFile /scratch

echo "example MPI program see: /usr/local/mvapich/examples/cpi.c"
#The directory the job was submitted from is $PBS_O_WORKDIR.
echo ${PBS_O_WORKDIR}

cp ${MPI_DIR}/examples/cpi ${PBS_O_WORKDIR}
application=${PBS_O_WORKDIR}/cpi

echo
cpus=$nodes
echo "=== Run MPI application on $cpus cpus (1 cpu per node) ==="
$MPI_DIR/bin/mpirun_rsh -np $nCores $application

# Always use fcp to copy any large result files you want to keep back
# to the head node before exiting your script. The /scratch area on the
# workers is wiped clean between jobs.
#
# Copy below is commented since there are no files to transfer
# fcp -c rsh -p /scratch/myOutputDataFile cluster-head-node.fnal.gov:/data/raid1/data/myDataArea

exit
  

3. Project Accounting

Effective Wednesday, August 24th, 2005, all authorized users who submit PBS jobs to the Fermilab Lattice QCD clusters will be required to specify their project name. An example of a command line job submission would be:

$ qsub -A project_name ...

or in a PBS job script file:

#! /bin/bash
#PBS -A project_name

mpirun -np $cpus $application
echo

exit

The PBS batch queue system will not honor jobs that are missing the -A project option.

$ qsub -l nodes=2 pbs.script
qsub: Invalid Account

All attempts to submit a job with the wrong project name will be rejected.

$ qsub -A invalid_project_name ...
qsub: Invalid Account

The following table, in no particular order, lists all the accepted project names. All project names should appear in a job submission the same way as listed in the table below (case sensitive).

4. Queue Policy and Queue Names

Users should submit jobs on the appropriate head node into one of the following 12 queues, depending upon which cluster is required, and which run policy fits their need:

Head Node Cluster Queues
Normal Background Test
jpsi1.fnal.gov J/Psi jpsi low_jpsi test_jpsi
kaon1.fnal.gov kaon kaon low_kaon test_kaon

Normal: Normal queues should be used for all standard running. Jobs in these queues run at normal priority.
Background: The background queue should be used for background jobs that will be run when unused nodes are available. Such jobs will be scheduled at the lowest priority and will run only when there are no other jobs in the corresponding queue.
Test: The test queues should be used for high priority, short jobs that require modest resources. Test queue jobs have the highest priority and will run as soon as nodes are available. A maximum of 2 test queue jobs may execute at a time, but only one per user per cluster. All test queue jobs have a maximum walltime limit of 1 hour, and a limit of 64 cores (CPUs). That is 8 nodes on Jpsi and 16 nodes on Kaon.

Node usage limits: A user is restricted on the usage of number of nodes per cluster scheduler.  There are two types of node limits per user, soft and hard. The soft limits apply when there are competing jobs in queue. Once the soft limits have reached and if there are available resources the hard limits apply. For e.g. let's say the soft and hard node limits per user are 10 and 32 for a 32-node cluster. In this case the scheduler will apply the 10 node limit on all jobs per user. After scheduling all jobs in queue if there are still available resources, the scheduler will launch additional jobs upto the 32 node limit per user. When unused nodes are available on any cluster, these node restrictions will not be enforced. The node restrictions for each cluster are summarized in the table below:

Cluster
Soft Limit
Hard Limit
kaon
400
1200
jpsi
256
900

**

We regularly adjust queue limits in response to reasonable user requests. Please send requests to lqcd-admin@fnal.gov.

Use the following command to see a particular queue's setting:

$ qmgr -c "list queue test"
Queue test
queue_type = Execution
Priority = 4
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
max_running = 2
resources_max.nodect = 8
resources_max.walltime = 01:00:00
resources_default.walltime = 01:00:00
resources_assigned.nodect = 0
max_user_run = 1
enabled = True
started = True

5. Managing Data Files in Batch Scripts

The recommended way to read a large data file from within a batch job is to first stage the file to local disk attached to the worker node before your application opens the file. Each worker node has a staging area called /scratch. The fcp command is the recommended way to copy files between the head node's filesystems and the worker nodes. Line 24 in the example above shows to how use fcp to copy files to the workers. Note that the file is copied to disk on only the first node of a multinode job (i.e. the one running the batch script). The user must perform additional copies if it is necessary to distribute a file to all worker nodes of a job.

Applications should write their output files to /scratch. Before exiting, scripts should copy all output files to be kept back to the head node as shown in line 40 of the example above.

All files on /scratch are deleted after the current job ends and before the next job begins.

6. Determining job priorities

Torque a.k.a PBS or OpenPBS is the batch queue system which accepts, authenticates and submits jobs to the various queues. The Maui scheduler works in concert with Torque to schedule the running of jobs in queue. The maui scheduler as currently configured calculates scheduling priorities as follows:

ACCOUNTWEIGHT*ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY

The weights are:

ACCOUNTWEIGHT   1000
CLASSWEIGHT 100

Note: CLASS is a synonym for queue (-q on the qsub command) and the priorities are:

ACCOUNTCFG[DEFAULT] PRIORITY=100
ACCOUNTCFG[cdms] PRIORITY=1
ACCOUNTCFG[xqcd] PRIORITY=2

Note: cdms and xqcd have exceeded their allocation and get a low priority.

CLASSCFG[test]        PRIORITY=300
CLASSCFG[workq] PRIORITY=200
CLASSCFG[low] PRIORITY=100

"/usr/local/maui/bin/diagnose -p" shows the priority calculation for each job which is eligible to run; i.e. a job not precluded from running because of node limits. The node limit is the total number of nodes, both in use (by jobs in R state) and requested (by jobs in Q state). Node limits can be exhuasted by a single user or by a project. If a project exhausts the node limits, then all users under the same project are affected.

NOTE: As of 12/4/2007 non-root users are not authorized to run the diagnose command and should use /usr/local/bin/diagnose_priority instead.

For example:

[root@lqcd ~]# /usr/local/maui/bin/diagnose -p

diagnosing job priority information (partition: ALL)

Job PRIORITY* Cred(Accnt:Class)
Weights -------- 1( 1000: 100)

103808 120000 100.0(100.0:200.0)
103809 120000 100.0(100.0:200.0)
103647 110000 100.0(100.0:100.0)
103648 110000 100.0(100.0:100.0)

All of the jobs have the default account priority=100. Two jobs in the workq queue/class ( wt=100, priority=200) and two jobs in the low queue/class (priority=100). For job 103808 the calculation is;

ACCOUNTWEIGHT* ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY
100 * 1000 + 100*200

/usr/local/maui/bin/diagnose -q shows the reason jobs are held. For example:

[root@lqcd ~]# /usr/local/maui/bin/diagnose -q
Diagnosing blocked jobs (policylevel SOFT partition ALL)

job 112841 violates active SOFT MAXJOB limit of 1 for user syoon (R: 1, U: 4)
job 113723 violates active SOFT MAXJOB limit of 1 for user syoon (R: 1, U: 4)
job 113786 violates active SOFT MAXJOB limit of 40 for user simone (R: 1, U: 40)

Another useful command to check the status of your job is

/usr/local/maui/bin/checkjob job_number

Should you need further assistance in interpreting queue behavior please do not hesitate to email us at lqcd-admin@fnal.gov

usqcd-webmaster@usqcd.org