Quick Links
- Submitting and Monitoring a Batch Job
- Example PBS Batch Script
- Project Accounting
- Queue Policy and Queue Names
- Managing Data Files in Batch
Scripts
- Determining job priorities
User jobs are run in batch mode on the cluster's worker
nodes. The batch system is the
TORQUE Resource Manager a.k.a PBS or OpenPBS. A batch script is written to
describe the sequence of programs that are to be run. Batch scripts are
familiar shell scripts, written for bash, sh
or csh, having additional PBS option lines,
prefixed by #PBS, which are treated as shell
comments. PBS options may be specified at the beginning of a batch
script or they may be specified as command line options to the qsub
command when the script is submitted to the batch system. Available
options are described in the unix man documentation for qsub.
The example batch script below specifies the required number of nodes and
walltime with the #PBS -l line, hence, they need
not
be repeated as arguments to the qsub command. Use qsub to submit the
example script to PBS as follows:
$ qsub batch-example 8795.lqcd.fnal.gov
Use the qstat command to view queued jobs.
$ qstat 8793.lqcd fullqcd steve 0 Q workq 8795.lqcd cpi-example jim 00:00:01 R workq 8796.lqcd onia jim 00:00:16 R workq 8797.lqcd bigJob don 00:00:16 R workq
Status "R" for job cpi-example indicates it is running.
Status "Q" indicate a job is queued and waiting to run. See the qstat
man pages for a descrition of all the command line options
available.
A graphical display of cluster status and running jobs
is available
via the web for the
jpsi and
kaon clusters.
The example bash script below runs the MPI example
program cpi (see: /usr/local/mvapich/example/cpi.c)
in parallel. The source for this script is here: batch-example and output from a
run is here: cpi-example.out.
In the example shown below, Lines 2 - 9 are PBS options.
Option "-S" selects bash as
the batch job shell.
Option "-N" is the name this
job will have in the queue.
Option "-j oe" redirects
stream stderr to stdout.
Option "-o" sets the name of
the file containing stdout from the job. Job output will be moved to
this file when the batch job finishes.
Option "-m n" disables email
notification of job status by PBS.
In line 7, "-l nodes=3,walltime=00:01:00"
lists the resources this job is requesting. A job requiring more than
one node must specify a value for nodes either in the batch file or
when the job is submitted. A Batch job must also specify the maximum
walltime it expects to take to complete. A job exceeding the requested
time limit will be terminated.
The "-A projectName" option
is required and specifies the name of the project to be charged for
cluster resource usage. Replace "projectName" with a vaid project name
that you belong to.
See the qsub documentation for
descriptions of the environment variables such as PBS_O_WORKDIR
and PBS_NODEFILE that PBS sets for batch
jobs.
#! /bin/bash
#PBS -S /bin/bash
#PBS -N cpi-example
#PBS -j oe
#PBS -o ./cpi-example.out
#PBS -m n
#PBS -l nodes=3,walltime=00:01:00
#PBS -A projectName
#PBS -q jpsi
cd ${PBS_O_WORKDIR}
# print identifying info for this job
echo "Job ${PBS_JOBNAME} submitted from ${PBS_O_HOST} started "`date`" jobid ${PBS_JOBID}"
export OFED_DIR=/usr/local/scidac-ofed-1.4.2/install
export QMP_DIR=$OFED_DIR/qmp
export QLA_DIR=$OFED_DIR/qla
export QIO_DIR=$OFED_DIR/qio
export QDP_DIR=$OFED_DIR/qdp
export MPI_DIR=/usr/local/mvapich-new
export GCC_DIR=/usr/local/gcc-4.3.2
export MILC_DIR=/usr/local/MILC
# determine number of cores per host
coresPerNode=`cat /proc/cpuinfo | grep -c processor`
# count the number of nodes listed in PBS_NODEFILE
nNodes=$[`cat ${PBS_NODEFILE} | wc --lines`]
(( nNodes= 0 + nNodes ))
(( nCores = nNodes * coresPerNode ))
echo "NODEFILE nNodes=$nNodes ($nCores cores):"
cat ${PBS_NODEFILE}
# Always use fcp or rsync to stage any large input files from the cluster head node
# to your job's control worker node.
# All worker nodes have attached disk storage in /scratch
# Copy below is commented since there are no files to transfer
# fcp -c rsh -p cluster-head-node.fnal.gov:/data/raid1/data/myDataArea/myInputDataFile /scratch
echo "example MPI program see: /usr/local/mvapich/examples/cpi.c"
#The directory the job was submitted from is $PBS_O_WORKDIR.
echo ${PBS_O_WORKDIR}
cp ${MPI_DIR}/examples/cpi ${PBS_O_WORKDIR}
application=${PBS_O_WORKDIR}/cpi
echo
cpus=$nodes
echo "=== Run MPI application on $cpus cpus (1 cpu per node) ==="
$MPI_DIR/bin/mpirun_rsh -np $nCores $application
# Always use fcp to copy any large result files you want to keep back
# to the head node before exiting your script. The /scratch area on the
# workers is wiped clean between jobs.
#
# Copy below is commented since there are no files to transfer
# fcp -c rsh -p /scratch/myOutputDataFile cluster-head-node.fnal.gov:/data/raid1/data/myDataArea
exit
Effective Wednesday, August 24th, 2005, all authorized
users who submit PBS jobs to the Fermilab Lattice QCD clusters will be
required to specify their project name. An example of a command line
job submission would be:
$ qsub -A project_name ...
or in a PBS job script file:
#! /bin/bash #PBS -A project_name
mpirun -np $cpus $application echo
exit
The PBS batch queue system will not honor jobs that are
missing the -A project option.
$ qsub -l nodes=2 pbs.script qsub: Invalid Account
All attempts to submit a job with the wrong project name
will be rejected.
$ qsub -A invalid_project_name ... qsub: Invalid Account
The following table, in no particular order, lists all
the accepted project names. All project names should appear in a job
submission the same way as listed in the table below (case sensitive).
Users should submit jobs on the appropriate head node
into one of the following 12 queues, depending upon which cluster is
required, and which run policy fits their need:
| Head Node |
Cluster |
Queues |
|
|
Normal |
Background |
Test |
| jpsi1.fnal.gov |
J/Psi |
jpsi |
low_jpsi |
test_jpsi |
| kaon1.fnal.gov |
kaon
| kaon |
low_kaon |
test_kaon |
Normal: Normal queues should be used for all standard
running. Jobs in these queues run at normal priority.
Background: The background queue should be used for background
jobs that will be run when unused nodes are available. Such jobs will
be scheduled at the lowest priority and will run only when there are no
other jobs in the corresponding queue.
Test: The test queues should be used for high priority,
short jobs that require modest resources. Test queue jobs have the
highest priority and will run as soon as nodes are available. A maximum
of 2 test queue jobs may execute at a time, but only one per user per
cluster. All test queue jobs have a maximum walltime limit of 1 hour,
and a limit of 64 cores (CPUs). That is 8 nodes on Jpsi and 16 nodes on
Kaon.
Node usage limits: A user is restricted on the usage of number of nodes per
cluster scheduler. There are two types of node limits per user, soft and hard. The soft limits apply when
there are competing jobs in queue. Once the soft limits have
reached and if there are available resources the hard limits apply. For
e.g. let's say the soft and hard node limits per user are 10 and 32 for
a 32-node cluster. In this case the scheduler will apply the 10 node
limit on all jobs per user. After scheduling all jobs in queue if there
are
still available resources, the scheduler will launch additional jobs
upto the 32 node limit per user. When unused nodes are available on any
cluster, these node restrictions will not be enforced. The node
restrictions for each cluster
are summarized in the table below:
Cluster
|
Soft
Limit
|
Hard
Limit
|
kaon
|
400
|
1200
|
jpsi
|
256
|
900
|
**
We regularly adjust queue limits in response to
reasonable user requests. Please send requests to lqcd-admin@fnal.gov.
Use the following command to see a particular queue's
setting:
$ qmgr -c "list queue test" Queue test queue_type = Execution Priority = 4 total_jobs = 0 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 max_running = 2 resources_max.nodect = 8 resources_max.walltime = 01:00:00 resources_default.walltime = 01:00:00 resources_assigned.nodect = 0 max_user_run = 1 enabled = True started = True
The recommended way to read a large data file from
within a batch job is to
first stage the file to local disk attached to the worker node before your
application opens the file. Each worker node has a staging area called /scratch.
The fcp command is the recommended way to copy
files
between the head node's filesystems and the worker nodes. Line 24 in
the
example above shows to how use fcp to copy files to the workers. Note
that the file is copied to disk on only the first node of a
multinode job (i.e. the one running the batch script).
The user must perform additional copies if it is necessary to
distribute
a file to all worker nodes of a job.
Applications should write their output files to /scratch.
Before exiting,
scripts should copy all output files to be kept
back to the head node as shown in line 40 of the example above.
All files on /scratch are deleted
after the current job ends and
before the
next job begins.
Torque a.k.a PBS or OpenPBS is the batch queue system which accepts, authenticates and submits jobs to the various queues. The Maui scheduler works in concert with Torque to schedule the running of jobs in queue.
The maui scheduler as currently configured calculates scheduling priorities as follows:
ACCOUNTWEIGHT*ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY
The weights are:
ACCOUNTWEIGHT 1000 CLASSWEIGHT 100
Note: CLASS is a synonym for queue (-q on the qsub command) and the priorities are: ACCOUNTCFG[DEFAULT] PRIORITY=100 ACCOUNTCFG[cdms] PRIORITY=1 ACCOUNTCFG[xqcd] PRIORITY=2
Note: cdms and xqcd have exceeded their allocation and get a low priority.
CLASSCFG[test] PRIORITY=300 CLASSCFG[workq] PRIORITY=200 CLASSCFG[low] PRIORITY=100
"/usr/local/maui/bin/diagnose -p" shows the priority calculation for each job which is eligible to run; i.e. a job not precluded from
running because of node limits. The node limit is the total number of nodes, both in use (by jobs in R state) and requested (by jobs in Q state). Node
limits can be exhuasted by a single user or by a project. If a project exhausts the node limits, then all users under the same project are affected.
NOTE: As of 12/4/2007 non-root users are not authorized to run the diagnose command and should use /usr/local/bin/diagnose_priority instead.
For example:
[root@lqcd ~]# /usr/local/maui/bin/diagnose -p
diagnosing job priority information (partition: ALL)
Job PRIORITY* Cred(Accnt:Class) Weights -------- 1( 1000: 100)
103808 120000 100.0(100.0:200.0) 103809 120000 100.0(100.0:200.0) 103647 110000 100.0(100.0:100.0) 103648 110000 100.0(100.0:100.0)
All of the jobs have the default account priority=100. Two jobs in the workq queue/class ( wt=100, priority=200) and two jobs in the low queue/class (priority=100). For job 103808 the calculation is;
ACCOUNTWEIGHT* ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY 100 * 1000 + 100*200
/usr/local/maui/bin/diagnose -q shows the reason jobs are held. For example:
[root@lqcd ~]# /usr/local/maui/bin/diagnose -q Diagnosing blocked jobs (policylevel SOFT partition ALL)
job 112841 violates active SOFT MAXJOB limit of 1 for user syoon (R: 1, U: 4) job 113723 violates active SOFT MAXJOB limit of 1 for user syoon (R: 1, U: 4) job 113786 violates active SOFT MAXJOB limit of 40 for user simone (R: 1, U: 40)
Another useful command to check the status of your job is /usr/local/maui/bin/checkjob job_number
Should you need further assistance in interpreting queue behavior please do not hesitate to email us at lqcd-admin@fnal.gov
|