- What is the best way to contact the
LQCD system administrators at Fermilab?
The best way to contact the LQCD system
administrators is by sending an email to lqcd-admin@fnal.gov.
- Is there an email list to contact
the other users?
Yes, the email list lqcd-users@fnal.gov has been
setup for that purpose but public posting of emails to this list is
restricted.
- How do I access the archives of the
mails send to the email lists?
The archives of the lqcd-admin and lqcd-users email
lists are available at:
http://listserv.fnal.gov/archives/lqcd-admin.html
http://listserv.fnal.gov/archives/lqcd-users.html
- How do I check my project
allocation balance?
Run the following command on
any cluster login node:
[@ds1]$> /usr/local/bin/lquota
As of Wed Nov 1 10:01, account lqcdproject used
1230274 of 4245202 allocated nodehours, which is 29.0 %
of total allocation
Note:
lquota prints the aggregate project
allocation balance for all Fermilab LQCD clusters.
- Is there a command to check the
status (running, free, busy etc.) of cluster worker nodes?
For example on kaon1.fnal.gov, which is the
head-node for the KAON cluster:
[user@kaon1]$ /usr/local/bin/lqstat -h
Print the number of free, busy, down or offline
nodes in the Fermilab LQCD kaon cluster.
Usage: lqstat <PRINT-OPTION>
PRINT-OPTION free,busy,down,offline,online
** no option will print the number of free nodes
of all types. The type of nodes are
kaon (AMD Opteron dual-processor dual-core, 64-bit, infiniband)
The same lqstat
command provides similar outputs when executed on the cluster login
head nodes for the corresponding clusters.
- I am unable to delete my job, and I
am getting an email message every 10s.?
This often occurs due to one or more cluster worker
nodes crashing as a result of hardware failure. When a cluster worker
node crashes, the PBS server has no way to contact the PBS client on
the failed node. The server attempts to kill or rerun the job every 10
seconds and almost always fails, generating an email message at every
attempt. If you encounter this problem please contact the Fermilab LQCD
system administrators by sending email to lqcd-admin@fnal.gov.
- What storage space do I use?
The following is a summary of available storage
space on the Fermilab LQCD clusters. All attempts have been made to
keep this table current.
Area
|
Description
|
| /project/xxxxx |
Area typically
used for approved
projects.
Visible on all cluster worker nodes via Lustre file-system.
Backups nightly. Suitable for output logs,
meson correlators and other small data files NOT suitable for fields
e.g configs, quark propagators
|
/home/<username>
|
Home area.
Backups nightly.
Visible on all cluster worker nodes via NFS.
Not suitable for configs or props.
Can be used as "run" directory for light production or testing.
Quota of about 4 to 6 GB per home directory. |
/pnfs/lqcd
|
Enstore Tape
storage.
Visible on cluster login head nodes only.
Ideal for permanent storage of parameter files and results.
Must use special copy command: 'dccp'
|
/lqcdproj
|
Lustre storage.
NO backups.
Visible on all cluster worker nodes.
Ideal for temporary storage (~month) of very large data files.
Disk space usage monitored and disk quotas enforced. |
- Why should I use fcp instead of rcp?
If your jobs need to copy data to/from areas in
/project and you are currently using commands like
rcp kaon1:/project/my.file /scratch/my.file
or
rcp jpsi1:/project/my.file /scratch/my.file
please instead use
fcp kaon1:/project/my.file /scratch/my.file
or
fcp jpsi1:/project/my.file /scratch/my.file
The simplest invocation of fcp:
fcp src.file kaon1:dst.file
will use rcp to do the transfer and can only
transfer a single file (that is, wildcards will not work). Further,
switches cannot be passed to rcp.
If you want to use wildcards, or add switches to rcp
such as -r (recursive copy) or -p (preserve attributes), use instead
the form:
fcp -c rcp [switches] src dest:dst
For example:
fcp -c rcp -r my_root_dir kaon1:dest_dir/
will do a recursive copy of my_root_dir to dest_dir/
on kaon1, and
rcp -c rcp file_spec_* kaon1:dest_dir/
will transfer all files named file_spec_* to
dest_dir on kaon1.
You can specify any command with "-c", so if your
scripts use rsync rather than rcp, try
fcp -c rsync [rsync switches] src dst
"fcp" has the same command syntax as "rcp". Unlike
rcp, fcp throttles access to individual file systems so that only a
limited number of accesses are attempted at a time (currently set to 2
per filesystem). Your fcp command will block, waiting in line, until
the file access is finished.
Overall throughput from these disk areas is higher
when we limit the number of simultaneous I/O transactions. When many
transactions occur simultaneously, data throughput is limited by motion
of the heads on the disk drives.
- Is one hour utililization of jpsi
node the same as one hour utilization of kaon node?
Projects are billed in equivalent jpsi core-hours.
The
conversion factors below are from the 2010 USQCD call for proposals.
Charges are based on node hours actually used by a job and not the
requested maximum walltime. The surest way to maximize your physics
output per charge unit is to benchmark your code on both jpsi and kaon.
USQCD has clusters with several kinds of nodes, from
single-processor,
single-core, to quad-processor, eight-core. The Scientific Program
Committee will use the following table to convert:
1 QCDOC node-hour = 0.24 Jpsi core-hour
1 kaon core-hour = 0.88 Jpsi core-hour
1 7n core-hour = 0.77 Jpsi core-hour
1 J/psi core-hour = 1 Jpsi core-hour
1 9q core-hour = 1.94 Jpsi core-hour
1 9g(GPU) hour = 1 GPU-hour
1 BG/P core-hour = 0.54 Jpsi core-hour
1 XT5 core-hour = 0.50 Jpsi core-hour
The above numbers are based on the average of asqtad
and DWF fermion
inverters. In the case of XT4 we used the average of asqtad and clover
inverters. See http://www.usqcd.org/fnal/hardware.html
for details.
- What are the batch queue scheduler
terms soft and hard node limits?
Please refer to this section of
our documentation.
- Is the kaon cluster login node kaon1.fnal.gov down? How do I access the kaon cluster?
As of Jan 31 2012, the Fermilab kaon cluster is now accessible only
through a gateway, lqcd.fnal.gov. We have removed kaon1.fnal.gov and kaon2.fnal.gov from public
internet access so that they can continue to run an older version of Linux and
still be in compliance with our security requirements. Other than this change
in access procedure, the Kaon cluster will operate as it has up to this date,
and no changes should be necessary in your binaries or job scripts. Kaon will
continue to operate without charges against your allocations. We do not
anticipate operating Kaon past the end of this allocation year.
To access the head node, kaon1, you should first use a kerberized client such
as ssh to login to lqcd.fnal.gov. From lqcd.fnal.gov, you can ssh to kaon1.
Because Kaon is now only on a private network, we will no longer be able to
use the tape-based nightly backups of /home. However, we have put in place an
alternative nightly backup scheme which will write backups to a RAID disk on
another machine. On kaon1 your lqcd.fnal.gov home area, which IS backed-up
nightly to tape, is available at /lqcdhome. On lqcd.fnal.gov, your Kaon home
area is available at /kaonhome.
As with other facilities that use ssh gateway nodes, you may find it
convenient to use ssh tunneling to access kaon1 via a single hop. An example
command to use on your local machine to setup a tunnel is:
ssh -X -f -N -L 2222:kaon1.fnal.gov:22 username@lqcd.fnal.gov
where "username" is your Fermilab username. After this command returns, you
can login to kaon1 by using
ssh -p 2222 username@localhost
Note that "2222" is the port number of the end of the tunnel on your local
machine. You can use any large number (> 1000) for this port. The syntax is
slightly different for the scp command:
scp -P 2222 username@localhost:file file
scp -P 2222 file username@localhost:file
See also
http://www.usqcd.org/fnal/transfer.html
for a description of the "tunnel.pl" tool which will setup ssh tunnels for
you.
If you have any questions, or have any concerns or issues with kaon please
send email to lqcd-admin@fnal.gov.