Quick Links
- Submitting and Monitoring a
Batch Job
- Example PBS Batch Script
- Project Accounting
- Queue Policy and Queue Names
- Managing Data Files in Batch
Scripts
- Determining job priorities
User jobs are run in batch mode on the cluster's worker
nodes. The batch system is the
TORQUE Resource Manager a.k.a PBS or OpenPBS. A batch script is
written to
describe the sequence of programs that are to be run. Batch scripts are
familiar shell scripts, written for bash, sh
or csh, having additional PBS option lines,
prefixed by #PBS, which are treated as shell
comments. PBS options may be specified at the beginning of a batch
script or they may be specified as command line options to the qsub
command when the script is submitted to the batch system. Available
options are described in the unix man documentation for qsub.
The example batch script below (Section 2) specifies the
required number of nodes and
walltime with the #PBS -l line, hence, they need
not
be repeated as arguments to the qsub command. Use qsub to submit the
example script to PBS as follows:
$ qsub batch-example 8795.lqcd.fnal.gov
Use the qstat command to view queued jobs.
$ qstat 8793.lqcd fullqcd steve 0 Q workq 8795.lqcd cpi-example jim 00:00:01 R workq 8796.lqcd onia jim 00:00:16 R workq 8797.lqcd bigJob don 00:00:16 R workq
Status "R" for job cpi-example indicates it is running.
Status "Q" indicates a job is queued and waiting to run. See the qstat
man pages for a description of all the command line options
available.
A graphical display of cluster status and running jobs
is available
via the web for all the Fermilab LQCD clusters here.
The example bash script below runs the MPI example
program cpi (see: /usr/local/mvapich/examples/cpi.c)
in
parallel
on the J/Psi cluster. The source for this script is here: batch-example and output from a
run is here: cpi-example.out.
In the example shown below, Lines 2 - 9 are PBS options.
Option "-S" selects bash as
the batch job shell.
Option "-N" is the name this
job will have in the queue.
Option "-j oe" redirects
stream stderr to stdout.
Option "-o" sets the name of
the file containing stdout from the job. Job output will be moved to
this file when the batch job finishes.
Option "-m n" disables email
notification of job status by PBS.
In line 7, "-l nodes=3,walltime=00:01:00"
lists the resources this job is requesting. A job requiring more than
one node must specify a value for nodes either in the batch file or
when the job is submitted. A batch job must also specify the maximum
walltime it expects to take to complete. A job exceeding the requested
time limit will be terminated.
The "-A projectName" option
is required and specifies the name of the project to be charged for
cluster resource usage. Replace "projectName" with a vaid project name
that you belong to. The list of current and valid project names is
available here.
See the qsub
documentation for
descriptions of the environment variables such as PBS_O_WORKDIR
and PBS_NODEFILE that PBS sets for batch
jobs.
1 #!/bin/bash 2 #PBS -S /bin/bash 3 #PBS -N cpi-example 4 #PBS -j oe 5 #PBS -o ./cpi-example.out 6 #PBS -m n 7 #PBS -l nodes=3,walltime=00:01:00 8 #PBS -A projectName 9 #PBS -q jpsi 10 cd ${PBS_O_WORKDIR} 11 # print identifying info for this job 12 echo "Job ${PBS_JOBNAME} submitted from ${PBS_O_HOST} started "`date`" jobid ${PBS_JOBID}" 13 export OFED_DIR=/usr/local/scidac-ofed-1.4.2/install 14 export QMP_DIR=$OFED_DIR/qmp 15 export QLA_DIR=$OFED_DIR/qla 16 export QIO_DIR=$OFED_DIR/qio 17 export QDP_DIR=$OFED_DIR/qdp 18 export MPI_DIR=/usr/local/mvapich-new 19 export GCC_DIR=/usr/local/gcc-4.3.2 20 export MILC_DIR=/usr/local/MILC 21 # determine number of cores per host 22 coresPerNode=`cat /proc/cpuinfo | grep -c processor` 23 # count the number of nodes listed in PBS_NODEFILE 24 nNodes=$[`cat ${PBS_NODEFILE} | wc --lines`] 25 (( nNodes= 0 + nNodes )) 26 (( nCores = nNodes * coresPerNode )) 27 echo "NODEFILE nNodes=$nNodes ($nCores cores):" 28 cat ${PBS_NODEFILE} 29 # Always stage (copy) any large input files from the cluster head node 30 # or Lustre to your job's control worker node. 31 # All worker nodes have attached disk storage in /scratch 32 # Copy of user data from Lustre as shown below is commented since there are no files to transfer. 33 # cp /lqcdproj/myDataArea/myInputDataFile /scratch 34 echo "example MPI program see: /usr/local/mvapich/examples/cpi.c" 35 #The directory the job was submitted from is $PBS_O_WORKDIR. 36 echo ${PBS_O_WORKDIR} 37 cp ${MPI_DIR}/examples/cpi ${PBS_O_WORKDIR} 38 application=${PBS_O_WORKDIR}/cpi 39 echo 40 cpus=$nodes 41 echo "=== Run MPI application on $nCores cores (for e.g. 8 cores per node on the jpsi cluster) ===" 42 $MPI_DIR/bin/mpirun_rsh -np $nCores $application 43 # Always copy any large result files you want to keep back 44 # to the head node or Lustre before exiting your script. The /scratch area on the 45 # workers is wiped clean between jobs. 46 # 47 # Copy below is commented since there are no files to transfer 48 # cp /scratch/myOutputDataFile /lqcdproj/myDataArea 49 exit
Effective Wednesday, August 24th, 2005, all authorized
users who submit batch jobs to the Fermilab Lattice QCD clusters will
be
required to specify their project name. An example of a command line
job submission would be:
$ qsub -A project_name ...
or in a PBS job script file:
#!/bin/bash #PBS -A project_name
mpirun -np $cpus $application echo
exit
The PBS batch queue system will not honor jobs that are
missing the -A project option.
$ qsub -l nodes=2 pbs.script qsub: Invalid Account
All attempts to submit a job with the wrong project name
will be rejected.
$ qsub -A invalid_project_name ... qsub: Invalid Account
The following table,
lists all current and valid project names. All project names should
appear in a job
submission the same way as listed in the allocation table (case
sensitive).
Users should submit jobs on the appropriate head node
into one of the following 11 queues, depending upon which cluster is
required, and which run policy fits their need:
| Head
Node |
Cluster |
Queues
|
|
|
Normal |
Background |
Test |
| ds1.fnal.gov |
Dsg |
gpu
|
|
test_gpu
|
| ds1.fnal.gov |
Ds |
ds
|
low_ds
|
test_ds
|
| jpsi1.fnal.gov |
Jpsi |
jpsi |
low_jpsi |
test_jpsi |
| kaon1.fnal.gov |
kaon
|
kaon |
low_kaon |
test_kaon |
Normal: Normal queues should be used for all
standard
running. Jobs in these queues run at normal priority.
Background: The background queue should be used
for
background
jobs that will be run when unused nodes are available. Such jobs will
be scheduled at the lowest priority and will run only when there are no
other jobs in the corresponding queue.
Test: The test queues should be used for high
priority,
short jobs that require modest resources. Test queue jobs have the
highest priority and will run as soon as nodes are available. A maximum
of 2 test queue jobs may execute at a time, but only two per user per
cluster. All test queue jobs have a maximum walltime limit of 1 hour,
and a limit of 64 cores (i.e. , 2 nodes on Ds, 8 nodes on Jpsi and 16 nodes on
Kaon) and on the Dsg cluster the limit is 2 nodes.
We regularly adjust queue limits in response to
reasonable user requests. Please send such requests to lqcd-admin@fnal.gov. Use the
following command to see a particular queue's
settings:
$ qmgr -c "list queue test" Queue test queue_type = Execution Priority = 4 total_jobs = 0 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 max_running = 2 resources_max.nodect = 8 resources_max.walltime = 01:00:00 resources_default.walltime = 01:00:00 resources_assigned.nodect = 0 max_user_run = 1 enabled = True started = True
The recommended way to read a large data file from
within a batch job is to
first stage (copy) the file to local disk attached to the worker node before
your
application opens the file. Each worker node has a staging area called /scratch.
Note
that the file is copied to disk on only the first node of a
multinode job (i.e. the one running the batch script).
The user must perform additional copies if it is necessary to
distribute
a file to all worker nodes of a job.
Applications should write their output files to /scratch.
Before
exiting,
scripts
should copy all output files to be kept
back to the head node as shown in line 48 of the example above.
NOTE:
All files on /scratch are deleted
after the current job ends and
before the
next job begins. Dsg, D/s, J/Psi and kaon cluster worker nodes have 440G, 210G, 132G and 83G scratch disk space available respectively.
Torque
a.k.a PBS or OpenPBS is the batch queue system which accepts,
authenticates and submits jobs to the various queues. The Maui
scheduler works in concert with Torque to schedule the running of jobs
in queue. The Maui scheduler as currently configured calculates
scheduling priorities as follows:
ACCOUNTWEIGHT*ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY
The weights are:
ACCOUNTWEIGHT 1000 CLASSWEIGHT 1000
Note: CLASS is a synonym for queue (-q on the qsub
command) and the priorities are:
ACCOUNTCFG[DEFAULT] PRIORITY=100 ACCOUNTCFG[cdms] PRIORITY=1 ACCOUNTCFG[xqcd] PRIORITY=2
Note: cdms and xqcd have exceeded their allocation and
get a low priority.
The following are priorities of the various queues.
CLASSCFG[test] PRIORITY=300 CLASSCFG[workq] PRIORITY=200 CLASSCFG[low] PRIORITY=100
"/usr/local/maui/bin/diagnose -p" shows the
priority calculation for each job which is eligible to run; i.e. a job
not precluded from running because of node limits. The node limit is
the total number of nodes, both in use (by jobs in R state) and
requested (by jobs in Q state). Node limits can be exhuasted by a
single user or by a project. If a project exhausts the node limits,
then all users under the same project are affected. Please read the node usage limits section for a detailed
explanation on node limits.
NOTE: As of Dec 2007
non-root users are not authorized to run the diagnose
command and
should use
/usr/local/bin/diagnose_priority instead. If there is no output
after running the command that
is not an error but indicates that there are no idle jobs in queue.
For example:
[@lqcd ~]# /usr/local/maui/bin/diagnose -p
diagnosing job priority information (partition: ALL)
Job PRIORITY* Cred(Accnt:Class) Weights -------- 1( 1000: 1000)
103808 300000 100.0(100.0:200.0) 103809 300000 100.0(100.0:200.0) 103647 200000 100.0(100.0:100.0) 103648 200000 100.0(100.0:100.0)
All of the jobs have the default account priority=100.
Two jobs in the workq queue/class ( wt=100, priority=200) and two jobs
in the low queue/class (priority=100). For job 103808 the calculation
is;
ACCOUNTWEIGHT* ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY 1000 * 100 + 1000 * 200
/usr/local/maui/bin/diagnose -q shows the
reason jobs are held. For example:
[@lqcd ~]# /usr/local/maui/bin/diagnose -q Diagnosing blocked jobs (policylevel SOFT partition ALL)
job 112841 violates active SOFT MAXJOB limit of 1 for user syoon (R: 1, U: 4) job 113723 violates active SOFT MAXJOB limit of 1 for user syoon (R: 1, U: 4) job 113786 violates active SOFT MAXJOB limit of 40 for user simone (R: 1, U: 40)
Another useful command to check the status of your job
is
[@lqcd ~]# /usr/local/maui/bin/checkjob job_number
Should you need further assistance in interpreting queue
behavior please email us at lqcd-admin@fnal.gov
Node usage limits:
A project and user is restricted on
the usage of number of nodes per
cluster. There are two types of node limits per project and user, soft and hard. The soft limits apply when
there are competing jobs in queue. Once the soft
limits have
reached and if there are available resources the hard limits apply. If a project reaches the limits then all users in the project
are limited by the set limits. If a user reaches the limits then only the user is affected by the set limits. For
e.g. let's say the soft and hard node limits per user are 10
and 32 for
a 32-node cluster. In this case the scheduler will apply the 10 node
limit on all jobs per user. After scheduling all jobs in queue if there
are
still available resources, the scheduler will launch additional jobs
upto the 32 node limit per user. When unused
nodes are available on any
cluster, these node restrictions will not be enforced. The node
restrictions for each cluster
are summarized in the table below:
Cluster
|
Soft
Limit
|
Hard
Limit
|
kaon
|
64
|
250
|
Jpsi
|
216
|
900
|
Ds
|
96
|
256
|
|