Example PBS Batch Script
User jobs are run in batch mode on the cluster's compute nodes.
The batch system is the
TORQUE Resource Manager or OpenPBS.
A batch script is written to describe the sequence of programs that are to be run.
Batch scripts are familiar shell scripts, written for bash, sh
or csh,
having additional PBS option lines, prefixed by #PBS,
which are treated as shell comments.
PBS options may be specified at the beginning of a batch script or they may be
specified as command line options to the qsub command
when the script is submitted to the batch system.
Available options are described in the unix man documentation for
qsub.
The example bash script below runs the MPI example program cpi
(see:
/usr/local/mvapich/example/cpi.c) in parallel.
The source for this script is here:
batch-example and output from a run is here:
cpi-example.out.
In the example shown below, Lines 2 - 8 are PBS options.
Option "-S" selects bash as
the batch job shell.
Option "-N" is the name this job will have
in the queue.
Option "-j oe" redirects stream stderr to stdout.
Option "-o" sets the name of the file containing stdout from the job.
Job output will be moved to this file when the batch job finishes.
Option "-m n" disables email notification of job status by PBS.
In line 7, "-l nodes=3,walltime=00:01:00" lists
the resources this job is requesting. A job requiring more than one node
must specify a value for nodes either in the batch file or when the job is
submitted. A Batch job must also specify the maximum
walltime it expects to take to complete. A job exceeding the requested time limit will be terminated.
The "-A projectName" option is required and specifies the
name of the project to be charged for cluster resource usage.
Replace "projectName" with a vaid project name that you belong to.
See the qsub documentation for descriptions of the environment variables
such as PBS_O_WORKDIR and PBS_NODEFILE
that PBS sets for batch jobs.
- #! /bin/bash
- #PBS -S /bin/bash
- #PBS -N cpi-example
- #PBS -j oe
- #PBS -o ./cpi-example.out
- #PBS -m n
- #PBS -l nodes=3,walltime=00:01:00
- #PBS -A projectName
- #PBS -q pion
- cd ${PBS_O_WORKDIR}
- # print identifying info for this job
- echo "Job ${PBS_JOBNAME} submitted from ${PBS_O_HOST} started "`date`" jobid ${PBS_JOBID}"
- # count the number of nodes listed in PBS_NODEFILE
- nodes=$[`cat ${PBS_NODEFILE} | wc --lines`]
- echo "Job allocated $nodes nodes"
- # Always use rcp or rsync to stage any large input files from the head node (lqcd)
- # to your job's control worker node.
- # All worker nodes have attached disk storage in /scratch
- #
- # Copy below is commented since there are no files to transfer
- # rcp -p lqcd:/data/raid1/data/myDataArea/myInputDataFile /scratch
- echo "example MPI program see: /usr/local/mvapich/examples/cpi.c"
- application=${PBS_O_WORKDIR}/cpi
- echo
- cpus=$nodes
- echo "=== Run MPI application on $cpus cpus (1 cpu per node) ==="
- mpirun -np $cpus $application
- # Always use rcp or rsync to copy any large result files you want to keep back
- # to the head node before exiting your script. The /scratch area on the
- # workers is wiped clean between jobs.
- #
- # There were no output files created in this example.
- # rcp -p /scratch/myOutputDataFile lqcd:/data/raid1/data/myDataArea
Queue Policy and Queue Names
On kaon1.fnal.gov, users should submit jobs to one of the following six queues, depending
upon which cluster is required, and which run policy:
kaon
low_kaon
test_kaon
pion
low_pion
test_pion
The "pion" and "kaon" queues should be used for all standard running. Jobs in
these queues run at normal priority. A given user may use a combined total of 300
nodes across any number of jobs on Kaon and Pion. When unused nodes are
available on either cluster, these 300-node restrictions will not be enforced.
The "low_kaon" and "low_pion" queues should be used for background jobs that
will be run when unused nodes are available. Such jobs will be scheduled at
the lowest priority and will run only when there are no other standard jobs in
the corresponding queue.
The "test_kaon" and "test_pion" queues should be used for high priority, short
jobs that require modest resources. "test" queue jobs have the highest
priority and will run as soon as nodes are available. A maximum of 2
"test_kaon" and 2 "test_pion" jobs may execute at a time, but only one per
user per cluster. "test_kaon" jobs have a maximum walltime limit of 1 hour,
and a node count limit of 16 (64 cores). "test_pion" jobs have a maximum
walltime limit of 1 hour, and a node count limit of 64 (64 cores).
We regularly adjust queue limits in response to reasonable user requests.
Please send requests to "lqcd-admin@fnal.gov".
The corresponding queue on lqcd.fnal.gov are:
workq
low
test
Use the following command to see a particular queue's setting:
$ qmgr -c "list queue test"
Queue test
queue_type = Execution
Priority = 4
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
max_running = 2
resources_max.nodect = 8
resources_max.walltime = 01:00:00
resources_default.walltime = 01:00:00
resources_assigned.nodect = 0
max_user_run = 1
enabled = True
started = True
Managing Data Files in Batch Scripts
The recommended way to read a large data file from within a batch job is to
first stage the file to local disk attached to the worker
before your application opens the file.
Each worker has a staging area called /scratch.
The rcp command is the recommended way to copy files between
the head node's filesystems and the worker nodes. Line 23 in the example above
shows to how use rcp to copy files to the workers. Note that the file
is copied to disk on only the first node of a multinode job
(i.e. the one running the batch script).
The user must perform additional copies if it is necessary to distribute
a file to all worker nodes of a job.
Applications should write their output files to /scratch.
Before exiting, scripts should copy all output files to be kept
back to the head node as shown in line 39 of the example above.
All files on /scratch are deleted after the current job ends
and before the next job begins.
Project Accounting
Effective Wednesday, August 24th, 2005, all authorized users who
submit PBS jobs
to the Fermilab Lattice QCD cluster will be required to specify their
project name. An example of a command line job submission would be:
$ qsub -A project_name ...
or in a PBS job script file:
#! /bin/bash
#PBS -A project_name
mpirun -np $cpus $application
echo
exit
The PBS batch queue
system will not honor jobs that are missing the -A project option.
$ qsub -l nodes=2 pbs.script
qsub: Invalid Account
All attempts to submit a job with the wrong project name will be rejected.
$ qsub -A invalid_project_name ...
qsub: Invalid Account
As of July 1 2006, the following table, in no particular order, lists all the accepted
project names. All project names should appear in a job submission the same way as listed in
the table below (case sensitive).
| Project |
| charmonium
|
| dynchiral
|
| fermimilcheavylight
|
| hasenfratz
|
| hpqcd
|
| latticesusy
|
| lqcdadmin
|
| milclat
|
| mixbk
|
| nplqcd
|
| usertest
|
| xqcd
|
For questions or concerns send mail to
lqcd-admin@fnal.gov
Submitting and Monitoring a Batch Job
Batch scripts are submitted for execution using the PBS qsub command.
See the unix man page and the PBS documentation for
a description of all of the command line
options for qsub.
The example batch script specifies the required number of nodes and
walltime with the #PBS -l line, hence,
they need not be repeated as arguments to the qsub command.
Use qsub to submit the example script to PBS:
$ qsub batch-example
8795.lqcd.fnal.gov
Use the qstat command to view queued jobs.
$ qstat
8793.lqcd fullqcd steve 0 Q workq
8795.lqcd cpi-example jim 00:00:01 R workq
8796.lqcd onia jim 00:00:16 R workq
8797.lqcd bigJob don 00:00:16 R workq
Status "R" for job cpi-example indicates it is running. Status "Q" indicate
a job is queued and waiting to run. See the qstat man pages for
a descrition of all the command line options available.
A graphical display of cluster status and running jobs
is available via the web cluster status display.
Determining job priorities
The maui scheduler as currently configured calculates scheduling
priority as follows:
ACCOUNTWEIGHT*ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY
The weights are:
ACCOUNTWEIGHT 1000
CLASSWEIGHT 100
note: CLASS is a synonym for queue (-q on the qsub command)
and the priorities are:
ACCOUNTCFG[DEFAULT] PRIORITY=100
ACCOUNTCFG[cdms] PRIORITY=1
ACCOUNTCFG[xqcd] PRIORITY=2
note: cdms and xqcd have exceeded their allcation and get a low priority.
CLASSCFG[test] PRIORITY=300
CLASSCFG[workq] PRIORITY=200
CLASSCFG[low] PRIORITY=100
"/usr/local/maui/bin/diagnose -p" shows the priority calculation for each job
which is eligible to run; i.e. is not precluded from ruunning because of limits.
See diagnose -q below.
For example:
[root@lqcd ~]# /usr/local/maui/bin/diagnose -p
diagnosing job priority information (partition: ALL)
Job PRIORITY* Cred(Accnt:Class)
Weights -------- 1( 1000: 100)
103808 120000 100.0(100.0:200.0)
103809 120000 100.0(100.0:200.0)
103647 110000 100.0(100.0:100.0)
103648 110000 100.0(100.0:100.0)
All of the jobs have the default account priority=100.
Two jobs in are the workq queue/class ( wt=100, priority=200) and two jobs
in the low queue/class (priority=100).
For job 103808 the calculation is;
ACCOUNTWEIGHT* ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY
100 * 1000 + 100*200
"/usr/local/maui/bin/diagnose -q" shows the reason jobs are held:
For example:
[root@lqcd ~]# /usr/local/maui/bin/diagnose -q
Diagnosing blocked jobs (policylevel SOFT partition ALL)
job 112841 violates active SOFT MAXJOB limit of 1 for user syoon (R: 1, U: 4)
job 113723 violates active SOFT MAXJOB limit of 1 for user syoon (R: 1, U: 4)
job 113786 violates active SOFT MAXJOB limit of 40 for user simone (R: 1, U: 40)
Another useful command to check the status of your job is
/usr/local/maui/bin/checkjob job_number
|