- What is the best way to contact the LQCD system administrators at Fermilab?
The best way to contact the LQCD system administrators is by sending an email to
lqcd-admin@fnal.gov.
- Is there an email list to contact the other users?
Yes, the email list lqcd-users@fnal.gov
has been setup for that purpose.
- How do I access the archives of the mails send to the email lists?
The archives of the lqcd-admin and lqcd-users email lists are available at:
http://listserv.fnal.gov/archives/lqcd-admin.html
http://listserv.fnal.gov/archives/lqcd-users.html
- How do I check my project allocation balance?
Run the following command on lqcd.fnal.gov:
[@lqcd]$> /usr/local/bin/lquota
As of Wed Nov 1 10:01, account lqcdproject used
1230274 of 4245202 allocated nodehours, which is 29.0 %
of total allocation
Note: lquota prints the project allocation balance for all Fermilab LQCD clusters.
- Is there a command to check the status (running, free, busy etc.) of cluster worker nodes?
On lqcd.fnal.gov, which is the head-node for the QCD cluster:
[@lqcd]$> /usr/local/bin/lqstat -h
Print the number of free, busy, down or offline
nodes in the Fermilab LQCD cluster.
Usage: lqstat [print-option]
PRINT-OPTION free,busy,down,offline,online
** no option will print the number of free nodes
of all types. There are two types of nodes
On kaon1.fnal.gov, which is the head-node for the KAON and PION(64-bit) cluster:
[@kaon2]$ /usr/local/bin/lqstat -h
Print the number of free, busy, down or offline
nodes in the Fermilab LQCD cluster.
Usage: lqstat [print-option]
PRINT-OPTION free,busy,down,offline,online
** no option will print the number of free nodes.
kaon (AMD Opteron dual-processor dual-core, 64-bit)
pion (Intel P4 single-processor, 64-bit)
printed in the following order
kaon pion
- I am unable to delete my job, and I am getting an email message every 10s.?
This often occurs due to one or more cluster worker
nodes crashing as a result of hardware failure. When a cluster worker
node crashes, the PBS server has no way to contact the PBS client on
the failed node. The server attempts to kill or rerun the job every 10
seconds and almost always fails, generating an email message at every
attempt. If you encounter this problem please contact the Fermilab LQCD
system administrators by sending email to lqcd-admin@fnal.gov.
- What storage space do I use?
The following is a summary of available storage
space on the Fermilab LQCD clusters. All attempts have been made to
keep this table current.
Area
|
Description
|
| /project/xxxxx |
Area typically used for approved
projects.
Visible on all cluster worker nodes via NFS.
Backups nightly. Suitable for output logs,
meson correlators and other small data files NOT suitable for fields
e.g configs, quark propagators
|
/home/<username>
|
Home area. Backups nightly.
Visible on all cluster worker nodes via NFS.
Not suitable for configs or props.
Can be used as "run" directory for light production or testing.
Quota of about 4 GB per home directory. |
/data/raidX
|
Raid storage. NO backups.
Visible on all head-nodes ONLY.
No quotas, unmanaged common area available to all users.
Individual disks are subject to filling up.
Must use rcp or rsync to copy data files from
cluster worker nodes to head-nodes.
Suitable for configurations and propagators.
|
/pnfs/volatile
|
Dcache storage. NO backups.
Visible on all cluster worker nodes.
Ideal for temporary storage (~month) of very large data files.
Must use special copy command: 'dccp'
File deleted by a "Least Recenty Used" policy when space is tight. |
- Why should I use fcp instead of rcp?
If your jobs need to copy data to/from areas in /data/raidx, or /project and you are currently using commands like
rcp kaon1:/data/raid4/my.file /scratch/my.file
or
rcp lqcd:/data/raid4/my.file /scratch/my.file
please instead use
fcp kaon1:/data/raid4/my.file /scratch/my.file
or
fcp lqcd:/data/raid4/my.file /scratch/my.file
The simplest invocation of fcp:
fcp src.file kaon1:dst.file
will use rcp to do the transfer and can only transfer a single file (that is, wildcards will not work). Further, switches cannot be passed to rcp.
If you want to use wildcards, or add switches to rcp such as -r (recursive copy) or -p (preserve attributes), use instead the form:
fcp -c rcp [switches] src dest:dst
For example:
fcp -c rcp -r my_root_dir kaon1:dest_dir/
will do a recursive copy of my_root_dir to dest_dir/ on kaon1, and
rcp -c rcp file_spec_* kaon1:dest_dir/
will transfer all files named file_spec_* to dest_dir on kaon1.
You can specify any command with "-c", so if your scripts use rsync rather than rcp, try
fcp -c rsync [rsync switches] src dst
"fcp" has the same command syntax as "rcp". Unlike rcp, fcp throttles access to individual file systems so
that only a limited number of accesses are attempted at a time (currently set to 2 per filesystem). Your
fcp command will block, waiting in line, until the file access is finished.
Overall throughput from these disk areas is higher when we limit the number of simultaneous I/O transactions.
When many transactions occur simultaneously, data throughput is limited by motion of the heads on the disk drives.
- Is one hour utililization of kaon node the same as one hour utilization of pion node?
Projects are billed in equivalent 6n node hours. The conversion factors below are from the 2008 USQCD call for proposals.
Charges are based on node hours actually used by a job and not the requested maximum walltime. The surest way to maximize
your physics output per charge unit is to benchmark your code on both kaon and pion.
USQCD has clusters with several kinds of nodes, from single-processor,
single-core, to dual-processor, quad-core. The Scientific Program
Committee will use the following table to convert:
1 QCDOC node-hour = 0.122 6n-equivalent node-hour
1 qcd node-hour = 0.498 6n-equivalent node-hour
1 pion node-hour = 0.683 6n-equivalent node-hour
1 6n node-hour = 1 6n-equivalent node-hour
1 kaon node-hour = 1.757 6n-equivalent node-hour
1 7n node-hour = 3.1 6n-equivalent node-hour
These numbers are based on the average of asqtad and DWF fermion
inverters. See http://lqcd.fnal.gov/performance.html
for details on clusters, including performance of the clover inverter.