USQCD Machine Performance



MachineProcessor per nodetotal no. of nodestotal no. of coresDWF per nodeClover per nodeasqtad per nodeJpsi Equivalence
jpsi2.1 GHz Dual CPU Quad Core Opteron856684810061 MFlops7423 MFlops9563 MFlops1 Jpsi-core-hour
ds2.0 GHz Quad CPU Eight Core Opteron4211347251520 MFlops42048 MFlops50547 MFlops1.33 Jpsi-core-hour
bc2.8 GHz Quad CPU Eight Core Opteron224716857408 MFlops46048 MFlops56224 MFlops1.48 Jpsi-core-hour
9q2.4 GHz Dual CPU Quad Core Nehalem320256019928 MFlops15056 MFlops18128 MFlops1.96 Jpsi-core-hour
10q2.53 GHz Dual CPU Quad Core Nehalem224179220408 MFlops15656 MFlops18046 MFlops2.00 Jpsi-core-hour
12s2.0 GHz Dual CPU Eight Core Sandy Bridge212339256500 MFlops32040 MFlops43740 MFlops2.44 Jpsi-core-hour
BlueGene/P850 MHz Quad Core PowerPC 850 1024 per rack4096 per rack2560 MFlops2511 MFlops2680 MFlops0.54 Jpsi-core-hour
Cray XT52.1 GHz Quad Core Opteron783231328-2232 MFlops2260 MFlops0.50 Jpsi-core-hour

COMMENTS:

The table above shows the measured performance of DWF, anisotropic clover, and asqtad inverters on the jpsi, Ds, Bc, 9q and 10q clusters, on the ANL BG/P, and the ORNL XT5. All performance numbers are single precision unless otherwise noted.

The DWF, Clover and asqtad performance figures for jpsi, Ds, Bc, 9q and 10q used 128-process (16-node, 4-node, 16-node,and 16-node respectively) runs, with 8 or 32 processes per node, one process per core. DWF and Clover data were taken with Chroma. Clover runs used 6^3x64 local (per core) lattices, and DWF runs used 14x7x7x16 local (per core) lattices with Ls=16. The runs for asqtad used 14^4 local (per core) lattices. Clover and DWF performance measurements used the CG_INVERTER in Chroma.

The DWF, Clover and asqtad performance figures for 12s are estimates taken from single node benchmarks and an assumed 0.9 scaling factor between single node (16 rank) and eight node (128 rank) runs.

The BG/P asqtad result is the average of the performance of 6^4 and 8^4 local volumes, and is single precision. The DWF result is double precision, using 4^4 (Ls=16) local volumes. The Clover result used 4096 cores.

The XT5 Clover performance figure is based on anisotropic Clover calculations on 32^3x256 global volume run on 24 cores (Robert Edwards) and HISQ runs on 64^3x128 lattices on 2k cores (Steve Gottlieb).

The final column of the table gives the Jpsi-equivalence for each of the USQCD resources. All except the Cray XT5 use the ratio of the average performance of asqtad and DWF; the XT5 uses the ratio of the average performance of the asqtad (HISQ) and clover inverters.