FNAL - LQCD Documentation

New Users / Account Renewal

User Authentication

Kerberos and SSH Troubleshooting

Building your code - The Runtime Environment

Submitting jobs to the TORQUE Batch System

Project Allocations

Software Documentation Details

Hardware Details

Filesystem Details

Mass/Tape Storage Details

Transferring Files

Compilers

FAQs

FAQs

  1. What is the best way to contact the LQCD system administrators at Fermilab ?
  2. Is there an email list to contact the other users ?
  3. How do I access the archives of the mails send to the email lists ?
  4. How do I check my project allocation balance?
  5. Is there a command to check the status (running, free, busy etc.) of cluster worker nodes?
  6. I am unable to delete my job, and I am getting an email message every 10s.?
  7. What storage space do I use?
  8. Why should I use fcp instead of rcp?
  9. Is one hour utililization of jpsi node the same as one hour utilization of  kaon node?
  10. What are the batch queue scheduler terms soft and hard node limits?
  11. Is the kaon cluster login node kaon1.fnal.gov down? How do I access the kaon cluster?
  1. What is the best way to contact the LQCD system administrators at Fermilab?

    The best way to contact the LQCD system administrators is by sending an email to lqcd-admin@fnal.gov.

  2. Is there an email list to contact the other users?

    Yes, the email list lqcd-users@fnal.gov has been setup for that purpose but public posting of emails to this list is restricted.

  3. How do I access the archives of the mails send to the email lists?

    The archives of the lqcd-admin and lqcd-users email lists are available at:
    http://listserv.fnal.gov/archives/lqcd-admin.html
    http://listserv.fnal.gov/archives/lqcd-users.html

  4. How do I check my project allocation balance?

    Run the following command on any cluster login node:

    [@ds1]$> /usr/local/bin/lquota
    As of Wed Nov 1 10:01, account lqcdproject used
    1230274 of 4245202 allocated nodehours, which is 29.0 %
    of total allocation

    Note: lquota prints the aggregate project allocation balance for all Fermilab LQCD clusters.

  5. Is there a command to check the status (running, free, busy etc.) of cluster worker nodes?

    For example on kaon1.fnal.gov, which is the head-node for the KAON cluster:

    [user@kaon1]$ /usr/local/bin/lqstat -h

    Print the number of free, busy, down or offline
    nodes in the Fermilab LQCD kaon cluster.

    Usage: lqstat <PRINT-OPTION>

    PRINT-OPTION free,busy,down,offline,online

    ** no option will print the number of free nodes
    of all types. The type of nodes are

    kaon (AMD Opteron dual-processor dual-core, 64-bit, infiniband)
    The same lqstat command provides similar outputs when executed on the cluster login head nodes for the corresponding clusters.

  6. I am unable to delete my job, and I am getting an email message every 10s.?

    This often occurs due to one or more cluster worker nodes crashing as a result of hardware failure. When a cluster worker node crashes, the PBS server has no way to contact the PBS client on the failed node. The server attempts to kill or rerun the job every 10 seconds and almost always fails, generating an email message at every attempt. If you encounter this problem please contact the Fermilab LQCD system administrators by sending email to lqcd-admin@fnal.gov.

  7. What storage space do I use?

    The following is a summary of available storage space on the Fermilab LQCD clusters. All attempts have been made to keep this table current.

    Area
    Description
    /project/xxxxx Area typically used for approved projects.
    Visible on all cluster worker nodes via Lustre file-system.
    Backups nightly. Suitable for output logs, meson correlators and other small data files NOT suitable for fields e.g configs, quark propagators
    /home/<username>
    Home area. Backups nightly. Visible on all cluster worker nodes via NFS.
    Not suitable for configs or props.
    Can be used as "run" directory for light production or testing.
    Quota of about 4 to 6 GB per home directory.
    /pnfs/lqcd
    Enstore Tape storage.
    Visible on cluster login head nodes only.
    Ideal for permanent storage of parameter files and results.
    Must use special copy command: 'dccp'
    /lqcdproj
    Lustre storage. NO backups.
    Visible on all cluster worker nodes.
    Ideal for temporary storage (~month) of very large data files.
    Disk space usage monitored and disk quotas enforced.

  8. Why should I use fcp instead of rcp?

    If your jobs need to copy data to/from areas in /project and you are currently using commands like

        rcp kaon1:/project/my.file /scratch/my.file
    or
    rcp jpsi1:/project/my.file /scratch/my.file

    please instead use

        fcp kaon1:/project/my.file /scratch/my.file
    or
    fcp jpsi1:/project/my.file /scratch/my.file

    The simplest invocation of fcp:

        fcp   src.file   kaon1:dst.file

    will use rcp to do the transfer and can only transfer a single file (that is, wildcards will not work). Further, switches cannot be passed to rcp.

    If you want to use wildcards, or add switches to rcp such as -r (recursive copy) or -p (preserve attributes), use instead the form:

        fcp  -c rcp  [switches] src dest:dst

    For example:

        fcp  -c rcp -r my_root_dir  kaon1:dest_dir/

    will do a recursive copy of my_root_dir to dest_dir/ on kaon1, and

        rcp  -c rcp file_spec_* kaon1:dest_dir/

    will transfer all files named file_spec_* to dest_dir on kaon1.

    You can specify any command with "-c", so if your scripts use rsync rather than rcp, try

       fcp  -c rsync  [rsync switches] src dst 

    "fcp" has the same command syntax as "rcp". Unlike rcp, fcp throttles access to individual file systems so that only a limited number of accesses are attempted at a time (currently set to 2 per filesystem). Your fcp command will block, waiting in line, until the file access is finished.

    Overall throughput from these disk areas is higher when we limit the number of simultaneous I/O transactions. When many transactions occur simultaneously, data throughput is limited by motion of the heads on the disk drives.

  9. Is one hour utililization of jpsi node the same as one hour utilization of kaon node?

    Projects are billed in equivalent jpsi core-hours. The conversion factors below are from the 2010 USQCD call for proposals. Charges are based on node hours actually used by a job and not the requested maximum walltime. The surest way to maximize your physics output per charge unit is to benchmark your code on both jpsi and kaon.

    USQCD has clusters with several kinds of nodes, from single-processor, single-core, to quad-processor, eight-core. The Scientific Program Committee will use the following table to convert:

            1 QCDOC node-hour = 0.24   Jpsi core-hour
    1 kaon core-hour = 0.88 Jpsi core-hour
    1 7n core-hour = 0.77 Jpsi core-hour
    1 J/psi core-hour = 1 Jpsi core-hour
    1 9q core-hour = 1.94 Jpsi core-hour
    1 9g(GPU) hour = 1 GPU-hour
    1 BG/P core-hour = 0.54 Jpsi core-hour
    1 XT5 core-hour = 0.50 Jpsi core-hour

    The above numbers are based on the average of asqtad and DWF fermion inverters. In the case of XT4 we used the average of asqtad and clover inverters. See http://www.usqcd.org/fnal/hardware.html for details.

  10. What are the batch queue scheduler terms soft and hard node limits?

    Please refer to this section of our documentation.

  11. Is the kaon cluster login node kaon1.fnal.gov down? How do I access the kaon cluster?

    As of Jan 31 2012, the Fermilab kaon cluster is now accessible only through a gateway, lqcd.fnal.gov. We have removed kaon1.fnal.gov and kaon2.fnal.gov from public internet access so that they can continue to run an older version of Linux and still be in compliance with our security requirements. Other than this change in access procedure, the Kaon cluster will operate as it has up to this date, and no changes should be necessary in your binaries or job scripts. Kaon will continue to operate without charges against your allocations. We do not anticipate operating Kaon past the end of this allocation year.

    To access the head node, kaon1, you should first use a kerberized client such as ssh to login to lqcd.fnal.gov. From lqcd.fnal.gov, you can ssh to kaon1.

    Because Kaon is now only on a private network, we will no longer be able to use the tape-based nightly backups of /home. However, we have put in place an alternative nightly backup scheme which will write backups to a RAID disk on another machine. On kaon1 your lqcd.fnal.gov home area, which IS backed-up nightly to tape, is available at /lqcdhome. On lqcd.fnal.gov, your Kaon home area is available at /kaonhome.

    As with other facilities that use ssh gateway nodes, you may find it convenient to use ssh tunneling to access kaon1 via a single hop. An example command to use on your local machine to setup a tunnel is:

      ssh -X -f -N -L 2222:kaon1.fnal.gov:22 username@lqcd.fnal.gov
    

    where "username" is your Fermilab username. After this command returns, you can login to kaon1 by using

      ssh -p 2222 username@localhost
    

    Note that "2222" is the port number of the end of the tunnel on your local machine. You can use any large number (> 1000) for this port. The syntax is slightly different for the scp command:

      scp -P 2222 username@localhost:file file
      scp -P 2222 file username@localhost:file
    

    See also http://www.usqcd.org/fnal/transfer.html for a description of the "tunnel.pl" tool which will setup ssh tunnels for you.

    If you have any questions, or have any concerns or issues with kaon please send email to lqcd-admin@fnal.gov.

usqcd-webmaster@usqcd.org