FNAL - LQCD Documentation

News Updated: 05/01/08

New Users

User Authentication

Hardware Details

Software Details

Filesystem Details

Mass/Tape Storage Details

Transferring Files

Compilers

Runtime Environment

TORQUE Batch System

FAQs

Example PBS Batch Script

User jobs are run in batch mode on the cluster's compute nodes. The batch system is the TORQUE Resource Manager or OpenPBS. A batch script is written to describe the sequence of programs that are to be run. Batch scripts are familiar shell scripts, written for bash, sh or csh, having additional PBS option lines, prefixed by #PBS, which are treated as shell comments. PBS options may be specified at the beginning of a batch script or they may be specified as command line options to the qsub command when the script is submitted to the batch system. Available options are described in the unix man documentation for qsub.

The example bash script below runs the MPI example program cpi (see: /usr/local/mvapich/example/cpi.c) in parallel. The source for this script is here: batch-example and output from a run is here: cpi-example.out.

In the example shown below, Lines 2 - 8 are PBS options.

Option "-S" selects bash as the batch job shell.

Option "-N" is the name this job will have in the queue.

Option "-j oe" redirects stream stderr to stdout.

Option "-o" sets the name of the file containing stdout from the job. Job output will be moved to this file when the batch job finishes.

Option "-m n" disables email notification of job status by PBS.

In line 7, "-l nodes=3,walltime=00:01:00" lists the resources this job is requesting. A job requiring more than one node must specify a value for nodes either in the batch file or when the job is submitted. A Batch job must also specify the maximum walltime it expects to take to complete. A job exceeding the requested time limit will be terminated.

The "-A projectName" option is required and specifies the name of the project to be charged for cluster resource usage. Replace "projectName" with a vaid project name that you belong to.

See the qsub documentation for descriptions of the environment variables such as PBS_O_WORKDIR and PBS_NODEFILE that PBS sets for batch jobs.

  1. #! /bin/bash
  2. #PBS -S /bin/bash
  3. #PBS -N cpi-example
  4. #PBS -j oe
  5. #PBS -o ./cpi-example.out
  6. #PBS -m n
  7. #PBS -l nodes=3,walltime=00:01:00
  8. #PBS -A projectName
  9. #PBS -q pion
  10. cd ${PBS_O_WORKDIR}
  11. # print identifying info for this job
  12. echo "Job ${PBS_JOBNAME} submitted from ${PBS_O_HOST} started "`date`" jobid ${PBS_JOBID}"
  13. # count the number of nodes listed in PBS_NODEFILE
  14. nodes=$[`cat ${PBS_NODEFILE} | wc --lines`]
  15. echo "Job allocated $nodes nodes"
  16. # Always use rcp or rsync to stage any large input files from the head node (lqcd)
  17. # to your job's control worker node.
  18. # All worker nodes have attached disk storage in /scratch
  19. #
  20. # Copy below is commented since there are no files to transfer
  21. # rcp -p lqcd:/data/raid1/data/myDataArea/myInputDataFile /scratch
  22. echo "example MPI program see: /usr/local/mvapich/examples/cpi.c"
  23. application=${PBS_O_WORKDIR}/cpi
  24. echo
  25. cpus=$nodes
  26. echo "=== Run MPI application on $cpus cpus (1 cpu per node) ==="
  27. mpirun -np $cpus $application
  28. # Always use rcp or rsync to copy any large result files you want to keep back
  29. # to the head node before exiting your script. The /scratch area on the
  30. # workers is wiped clean between jobs.
  31. #
  32. # There were no output files created in this example.
  33. # rcp -p /scratch/myOutputDataFile lqcd:/data/raid1/data/myDataArea

Queue Policy and Queue Names

On kaon1.fnal.gov, users should submit jobs to one of the following six queues, depending upon which cluster is required, and which run policy:

    kaon
    low_kaon
    test_kaon
    pion
    low_pion
    test_pion

The "pion" and "kaon" queues should be used for all standard running. Jobs in these queues run at normal priority. A given user may use a combined total of 300 nodes across any number of jobs on Kaon and Pion. When unused nodes are available on either cluster, these 300-node restrictions will not be enforced.

The "low_kaon" and "low_pion" queues should be used for background jobs that will be run when unused nodes are available. Such jobs will be scheduled at the lowest priority and will run only when there are no other standard jobs in the corresponding queue.

The "test_kaon" and "test_pion" queues should be used for high priority, short jobs that require modest resources. "test" queue jobs have the highest priority and will run as soon as nodes are available. A maximum of 2 "test_kaon" and 2 "test_pion" jobs may execute at a time, but only one per user per cluster. "test_kaon" jobs have a maximum walltime limit of 1 hour, and a node count limit of 16 (64 cores). "test_pion" jobs have a maximum walltime limit of 1 hour, and a node count limit of 64 (64 cores).

We regularly adjust queue limits in response to reasonable user requests. Please send requests to "lqcd-admin@fnal.gov".

The corresponding queue on lqcd.fnal.gov are:

    workq
    low
    test

Use the following command to see a particular queue's setting:

$ qmgr -c "list queue test"
Queue test
        queue_type = Execution
        Priority = 4
        total_jobs = 0
        state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
        max_running = 2
        resources_max.nodect = 8
        resources_max.walltime = 01:00:00
        resources_default.walltime = 01:00:00
        resources_assigned.nodect = 0
        max_user_run = 1
        enabled = True
        started = True

Managing Data Files in Batch Scripts

The recommended way to read a large data file from within a batch job is to first stage the file to local disk attached to the worker before your application opens the file. Each worker has a staging area called /scratch. The rcp command is the recommended way to copy files between the head node's filesystems and the worker nodes. Line 23 in the example above shows to how use rcp to copy files to the workers. Note that the file is copied to disk on only the first node of a multinode job (i.e. the one running the batch script). The user must perform additional copies if it is necessary to distribute a file to all worker nodes of a job.

Applications should write their output files to /scratch. Before exiting, scripts should copy all output files to be kept back to the head node as shown in line 39 of the example above.

All files on /scratch are deleted after the current job ends and before the next job begins.

Project Accounting

Effective Wednesday, August 24th, 2005, all authorized users who submit PBS jobs to the Fermilab Lattice QCD cluster will be required to specify their project name. An example of a command line job submission would be:

$ qsub -A project_name ...

or in a PBS job script file:

#! /bin/bash
#PBS -A project_name

mpirun -np $cpus $application
echo

exit

The PBS batch queue system will not honor jobs that are missing the -A project option.

$ qsub -l nodes=2 pbs.script
qsub: Invalid Account 

All attempts to submit a job with the wrong project name will be rejected.

$ qsub -A invalid_project_name ...
qsub: Invalid Account 

As of July 1 2006, the following table, in no particular order, lists all the accepted project names. All project names should appear in a job submission the same way as listed in the table below (case sensitive).

Project
charmonium
dynchiral
fermimilcheavylight
hasenfratz
hpqcd
latticesusy
lqcdadmin
milclat
mixbk
nplqcd
usertest
xqcd

For questions or concerns send mail to lqcd-admin@fnal.gov

Submitting and Monitoring a Batch Job

Batch scripts are submitted for execution using the PBS qsub command. See the unix man page and the PBS documentation for a description of all of the command line options for qsub. The example batch script specifies the required number of nodes and walltime with the #PBS -l line, hence, they need not be repeated as arguments to the qsub command. Use qsub to submit the example script to PBS:

$ qsub batch-example
8795.lqcd.fnal.gov

Use the qstat command to view queued jobs.

$ qstat
8793.lqcd      fullqcd          steve                   0 Q workq    
8795.lqcd      cpi-example      jim              00:00:01 R workq           
8796.lqcd      onia             jim              00:00:16 R workq           
8797.lqcd      bigJob           don              00:00:16 R workq
Status "R" for job cpi-example indicates it is running. Status "Q" indicate a job is queued and waiting to run. See the qstat man pages for a descrition of all the command line options available.

A graphical display of cluster status and running jobs is available via the web cluster status display.

Determining job priorities

The maui scheduler as currently configured calculates scheduling priority as follows:

ACCOUNTWEIGHT*ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY

The weights are:

ACCOUNTWEIGHT   1000
CLASSWEIGHT     100
note: CLASS is a synonym for queue (-q on the qsub command)


and the priorities are:

ACCOUNTCFG[DEFAULT] PRIORITY=100
ACCOUNTCFG[cdms] PRIORITY=1
ACCOUNTCFG[xqcd] PRIORITY=2

note: cdms and xqcd have exceeded their allcation and get a low priority.

CLASSCFG[test]        PRIORITY=300
CLASSCFG[workq]       PRIORITY=200
CLASSCFG[low]         PRIORITY=100


"/usr/local/maui/bin/diagnose -p" shows the priority calculation for each job
which is eligible to run; i.e. is not precluded from ruunning because of limits.
See diagnose -q below.

For example:

[root@lqcd ~]# /usr/local/maui/bin/diagnose -p
diagnosing job priority information (partition: ALL)

Job                    PRIORITY*   Cred(Accnt:Class)
     Weights   --------       1( 1000:  100)

103808                   120000   100.0(100.0:200.0)
103809                   120000   100.0(100.0:200.0)
103647                   110000   100.0(100.0:100.0)
103648                   110000   100.0(100.0:100.0)


All of the jobs have the default account priority=100. Two jobs in are the workq queue/class ( wt=100, priority=200) and two jobs in the low queue/class (priority=100).

For job 103808 the calculation is;

ACCOUNTWEIGHT* ACCOUNTPRIORITY + CLASSWEIGHT*CLASSPRIORITY
 100 * 1000            + 100*200


"/usr/local/maui/bin/diagnose -q" shows the reason jobs are held:
For example:

[root@lqcd ~]# /usr/local/maui/bin/diagnose -q
Diagnosing blocked jobs (policylevel SOFT  partition ALL)

job 112841 violates active SOFT MAXJOB limit of 1 for user syoon  (R: 1, U: 4)
job 113723 violates active SOFT MAXJOB limit of 1 for user syoon  (R: 1, U: 4)
job 113786 violates active SOFT MAXJOB limit of 40 for user simone  (R: 1, U: 40)

Another useful command to check the status of your job is /usr/local/maui/bin/checkjob job_number

usqcd-webmaster@usqcd.org