Fermilab Lattice QCD Computing Hardware
Modern computing hardware used in lattice gauge theory
calculations (such as in Fermilab's 127-node Myrinet cluster shown to
the right) had a price/performance that was better than
50¢/Megaflop. This can be compared with approximately
$1,000,000/MF on the VAX 11/780s on which the first numerical lattice
calculations were done 20 years ago, and $100/MF for Fermilab's ACPMAPS
computer in the early 1990s.
Our past and current production clusters are:
- 120-node cluster (qcd) (decommissioned
April 2010) with single-socket 2.8 GHz Pentium 4 processors and a
Myrinet
fabric,
- 486-node cluster (pion) (decommissioned
April 2010) with single-socket 3.2 GHz Pentium 640 processors and
an
Infiniband fabric,
- 600-node cluster (kaon) with dual-socket
dual-core
Opteron 270 (2.0 GHz) processors and a double-data-rate Infiniband
fabric,
- 856-node cluster (jpsi) with dual-socket
quad-core
Opteron 2352 (2.1 GHz) processors and a double-data-rate Infiniband
fabric and
- 420-node cluster (ds) with quad-socket
eight-core
Opteron 6128 (2.0 GHz) processors and a quad-data-rate Infiniband
fabric.
- 76-node cluster (dsg) with dual-socket
four-core Intel Xeon E5630 processors, two NVidia Tesla M2050 GPUs per node and a quad-data-rate Infiniband
fabric.
The Pentium processors on the qcd and pion
clusters had an 800 MHz front side bus. qcd used DDR memory,
and pion DDR2 memory. The Opteron processors on the kaon
also have 800 MHz front side buses, and use DDR memory. The processors
on the jpsi cluster use 1066 MHz DDR memory, and the processors
on the ds cluster use 1333 MHz DDR memory. Pictured on the top
left is one of the jpsi nodes.
The table below shows the measured performance of DWF
and asqtad inverters on all the Fermilab LQCD clusters. For qcd and pion,
the asqtad
numbers were taken on 64-node runs, 14^4 local lattice per node, and
the DWF numbers were taken on 64-node runs using Ls=16, averaging the
performance of 32x8x8x8 and 32x8x8x12 local lattice runs together. The
DWF and asqtad performance figures for kaon use 128-process
(32-node) runs, with 4 processes per node, one process per core. The
DWF and asqtad performance figures for jpsi use 128-process
(16-node) runs, with 8 processes per node, one process per core. The
DWF and asqtad performance figures for ds use 128-process
(4-node) runs, with 32 processes per node, one process per core.
| Cluster |
Processor |
Nodes |
DWF
Performance |
asqtad Performance |
| qcd |
2.8GHz Single
CPU Single Core P4E |
127 |
1400 MFlops/node |
1017 MFlops/node |
| pion |
3.2GHz Single
CPU Single Core Pentium 640 |
486 |
1729 MFlops/node |
1594 MFlops/node |
| kaon |
2.0GHz Dual CPU
Dual Core Opteron |
600 |
4703 MFlops/node |
3832 MFlops/node |
| jpsi |
2.1GHz Dual CPU
Quad Core Opteron |
856 |
10061 MFlops/node |
9563 MFlops/node |
| ds |
2GHz Quad CPU
Eight Core Opteron |
420 |
51520 MFlops/node |
50553 MFlops/node |
The jpsi cluster uses PCI Express Infiniband
network interface cards in each node, 50 24-port leaf Infiniband
switches and one 288-port
spine Infiniband switch. All nodes connect to the leaves and the
leaves with 3:1 oversubscription (6 uplinks per leaf) connect to the
spine. Each jpsi node achieves a peak (measured over MPI)
unidirectional bandwidth of 1315 MB/sec and bidirectional bandwidth of
2160 MB/sec. While jpsi uses double data rate Infiniband, the ds
cluster uses quad rate Infiniband, with measured maximum unidirectional
bandwidth of 2640 MB/sec and maximum bidirectional bandwidth of 4980
MB/sec. Pictured on the right is a 288-port Infiniband spine switch.
|