FNAL - LQCD Documentation

New Users / Account Renewal

User Authentication

Kerberos and SSH Troubleshooting

Submitting jobs to the TORQUE Batch System

Project Allocations

Software Documentation Details

Hardware Details

Filesystem Details

Mass/Tape Storage Details

Transferring Files

Compilers

Using GPUs

FAQs

Contact Us

Frequently Asked Questions

  1. What is the best way to contact the LQCD system administrators at Fermilab ?
  2. Is there an email list to contact the other users ?
  3. How do I access the archives of the mails send to the email lists ?
  4. How do I check my project allocation balance?
  5. Is there a command to check the status (running, free, busy etc.) of cluster worker nodes?
  6. I am unable to delete my job, and I am getting an email message every 10s.?
  7. What storage space do I use?
  8. Is one hour utililization of jpsi node the same as one hour utilization of  kaon node?
  9. What are the batch queue scheduler terms soft and hard node limits?
  10. How do I change my Kerberos password?
  11. Is there a way to copy more than one file into Enstore tape using wild cards?
  12. Is there a quick way to copy a file from Enstore tape to my local disk?
  13. Can I use Globus Online to transfer data between Lustre?
  14. How do I verify that files have been successfully transferred to tape (Enstore)?
  1. What is the best way to contact the LQCD system administrators at Fermilab?

    The best way to contact the LQCD system administrators is by sending an email to lqcd-admin@fnal.gov.

  2. Is there an email list to contact the other users?

    Yes, the email list lqcd-users@fnal.gov has been setup for that purpose but public posting of emails to this list is restricted.

  3. How do I access the archives of the mails send to the email lists?

    The archives of the lqcd-admin and lqcd-users email lists are available at:
    http://listserv.fnal.gov/archives/lqcd-admin.html
    http://listserv.fnal.gov/archives/lqcd-users.html
    Use "Search the archives" link at the top of each page to find the email thread you are looking for.

  4. How do I check my project allocation balance?

    Run the following command on any cluster login node:

    [@ds1]$> /usr/local/bin/lquota
    Account usage for user johndoe since 2012-07-01 00:00:00, 16.8% of year Account | johndoe YTD | Project YTD | % | Allocation | Burn% | Yr.End% -------------+------------------+------------------+--------+------------------+------------+------------- lqcdproject | 257,956.1 | 2,088,779.9 | 12.3 | 11,790,000 | 17.7% | 104.9% unbilled | 0.0 | 0.0 | ? | | | % | 100.0% billable | 100.0% billable | | | | all | 257,956.1 | 2,088,779.9 | 12.3 | | | -------------+------------------+------------------+--------+------------------+------------+------------ (5 rows) User column is class of usage used by user. Project column is class of usage used altogether. % column is fraction of class of usage used by user. First row is billable usage. Second row is unbilled usage. Third row is percentage of total usage that is billable (Row 1 / Row 4). Fourth row is total usage across all user of the account. Projects total is the total of all accounts associated with user. Allocation is the current JPSI normalized core-hours for the account. Burn% is the percentage of the current allocation already used. Yr.End% is the year end percentage of the current allocation, extrapolated to project year end (2013-07-01 00:00:00) assuming current rate. Absence of data indicates no data. Zeroes indicate minimal amounts.

    Note: lquota prints the aggregate project allocation balance for all Fermilab LQCD clusters.

  5. Is there a command to check the status (running, free, busy etc.) of cluster worker nodes?

    For example on kaon1.fnal.gov, which is the head-node for the KAON cluster:

    [user@kaon1]$ /usr/local/bin/lqstat -h

    Print the number of free, busy, down or offline
    nodes in the Fermilab LQCD kaon cluster.

    Usage: lqstat <PRINT-OPTION>

    PRINT-OPTION free,busy,down,offline,online

    ** no option will print the number of free nodes
    of all types. The type of nodes are

    kaon (AMD Opteron dual-processor dual-core, 64-bit, infiniband)
    The same lqstat command provides similar outputs when executed on the cluster login head nodes for the corresponding clusters.

  6. I am unable to delete my job, and I am getting an email message every 10s.?

    This often occurs due to one or more cluster worker nodes crashing as a result of hardware failure. When a cluster worker node crashes, the PBS server has no way to contact the PBS client on the failed node. The server attempts to kill or rerun the job every 10 seconds and almost always fails, generating an email message at every attempt. If you encounter this problem please contact the Fermilab LQCD system administrators by sending email to lqcd-admin@fnal.gov.

  7. What storage space do I use?

    The following is a summary of available storage space on the Fermilab LQCD clusters. All attempts have been made to keep this table current.

    Area
    Description
    /project/xxxxx Area typically used for approved projects.
    Visible on all cluster worker nodes via NFS file-system.
    Backups nightly. Suitable for output logs, meson correlators and other small data files NOT suitable for fields e.g configs, quark propagators
    /home/<username>
    Home area. Backups nightly. Visible on all cluster worker nodes via NFS.
    Not suitable for configs or props.
    Can be used as "run" directory for light production or testing.
    Quota of about 4 to 6 GB per home directory.
    /pnfs/lqcd
    Enstore Tape storage.
    Visible on cluster login head nodes only.
    Ideal for permanent storage of parameter files and results.
    Must use special copy command: 'dccp'
    /lqcdproj
    Lustre storage. NO backups.
    Visible on all cluster worker nodes.
    Ideal for temporary storage (~month) of very large data files.
    Disk space usage monitored and disk quotas enforced.

  8. Is one hour utililization of jpsi node the same as one hour utilization of kaon node?

    Projects are billed in equivalent jpsi core-hours. The conversion factors below are from the 2012 USQCD call for proposals. Charges are based on node hours actually used by a job and not the requested maximum walltime. The surest way to maximize your physics output per charge unit is to benchmark your code on both jpsi and kaon.

    USQCD has clusters with several kinds of nodes, from single-processor, single-core, to quad-processor, eight-core. The Scientific Program Committee will use the following table to convert:

    1 J/psi core-hour  = 1     Jpsi core-hour
    1 Ds    core-hour  = 1.33  Jpsi core-hour
    1 7n    core-hour  = 0.77  Jpsi core-hour
    1 9q    core-hour  = 2.2   Jpsi core-hour
    1 9g/Dsg(GPU) hour = 1     GPU-hour
    1 10q   core-hour  = 2.3   Jpsi core-hour
    1 BG/P  core-hour  = 0.54  Jpsi core-hour
    1 XT5   core-hour  = 0.50  Jpsi core-hour

    The above numbers are based on the average of asqtad and DWF fermion inverters. In the case of XT5 we used the average of asqtad (HISQ) and clover inverters. See http://lqcd.fnal.gov/performance.html for details.

  9. What are the batch queue scheduler terms soft and hard node limits?

    Please refer to this section of our documentation.

  10. How do I change my Kerberos password?

    Please refer to the "Changing your Kerberos password" section of our documentation.

  11. Is there a way to copy more than one file into Enstore tape using wild cards?

    I'm trying to issue this command

    dccp -c -C 2000 filename.${i}* /pnfs/lqcd/myproject/subdir/

    No, dccp doesn't allow wildcards (see the 2nd bullet here). dccp works through a disk cache layer, meaning that files are initially copied to one of a large number of RAID disk arrays managed by the mass storage department, and then automatically migrated from there to tape. Generally the migrations occur very quickly, say within an hour.

    There is a direct tape access command, encp, that also has the semantics of "cp" but which allows wildcards. The downside to encp is that your command obviously has to block until a tape drive is allocated, the tape is mounted, and finally the tape is positioned. dccp writes, on the other hand, commence immediately. On reads, dccp like encp will need to read from tape and will block, unless the file already is resident on one of the disk cache pools. It is possible to pre-stage files to the disk pools with a dccp command, so if you want to read in several hundred files you can issue a pre-stage request an hour or so before you need them, and then after the delay use dccp commands to read the actual files.

    Our tape drives are faster than the network connections from ds1/jpsi1/lqcdsrm can feed (LTO4 drives are ~ 150 MB/sec, in the near future LQCD will use T10K drives at closer to 500 MB/sec). In part because of this, in general Fermilab prefers us to use dccp, since then finite resources (tape drives) are not partially idling waiting for data.

    As of Sept. 20, 2012 we are working on provisioning lqcdsrm.fnal.gov with a 10 GigE card, which will allow tape streaming at full rate and would be best for encp commands.

  12. Is there a quick way to copy a file from Enstore tape to my local disk?

    Your dccp command to copy a file from Enstore tape to your local disk will often seem stuck since it is waiting for the tape to be retrieved, mounted and read which can take several minutes to an hour or more. To avoid this, the command to use is dccp -P which prestages the request; for read requests only and returns immediately. But the dccp -P command will prestage the file from Enstore tape to a dCache "read pool" first, which is far quicker than copying the file directly from tape to your local disk. Once the file is on the dCache "read pool", you should execute another dccp command (you don't have to use the -P option this time) to copy the file from the dCache "read pool" to your local disk. Use the dccp -P -t -1 command to query if your file is on the "read pool" in dCache or not.

    Execute the dccp -P command with the -t option as follows:

         dccp -P -t xxxx source [destination] 
     

    where xxxx is the number of seconds before you'd like the file, 3600 is typical but if -t is not used, the default interval is zero and as explained above a value of -1 will return the status of your file in the dCache "read pool".

  13. Can I use Globus Online to transfer data between Lustre?

    Yes you can use Globus Online to transfer data between Lustre. Please follow the simple steps listed here.

  14. How do I verify that files have been successfully transferred to tape (Enstore)?

    The following is a set of scripts and commands to check if files have been transferred successfully to tape (Enstore). In the example below the commands were run on ds1.fnal.gov.

    [@ds1 $>] source /usr/local/etc/setups.sh
    [@ds1 $>] setup -q stken:x86_64 encp
    [@ds1 $>] en_check /pnfs/lqcd/test/testfile.tar ; echo $?
    1
    

    Results from en_check command (exit statuses):

    0 file is on tape
    1 file is not on tape
    

    There are further details for files packaged with (Small File Aggregation) SFA so it is better to use en_check as it can be extended to do extra checks for SFA files as well.

    Additional Documentation about Enstore commands:

    USQCD: http://www.usqcd.org/fnal/storage.html
    User's Guide : http://www.fnal.gov/docs/products/enstore/ [Transfering files via ENCP (on-site only) refer section 6.4.3]

usqcd-webmaster@usqcd.org