QCDSTREAM Benchmark

Introduction

QCDSTREAM was inspired by John McCalpin's STREAM sustainable memory bandwidth benchmark. Lattice QCD physics codes are typically memory bandwidth or floating point limited. Data access patterns and the interactions of structure layout with cache line lengths and memory controller dynamics can significantly impact performance of these codes. QCDSTREAM is designed to investigate these effects, as well as to measure the performance boost obtained with SSE versions of complex matrix-vector and matrix-matrix operations (see also these SSE discussions).

Details

Memory Bandwidth

In the STREAM "Copy" measurement, the rate of movement of double precision values between arrays is measured. QCDSTREAM adds measurements of the rate of movement using floats and long doubles. The latter measurement uses inline SSE instructions to take advantage of the 128-bit wide SSE registers and cache-bypass write operations. Because of SSE usage, this section of QCDSTREAM will only run on Pentium III, Pentium 4, and Athlon processors which support SSE.

Here is a sample extracted from a 1.7 GHz Pentium 4 Xeon (Foster) run:


  Function        Rate (MB/s)             Mean time               Min time        Max time
  --------        -----------             ---------               --------        --------
  Float Copy:     1234.0 +/-  0.9         0.0260 +/- 0.00002      0.0259          0.0260
  Double Copy:    1253.9 +/-  1.0         0.0256 +/- 0.00002      0.0255          0.0256
  SSE Copy:       2121.2 +/-  4.9         0.0151 +/- 0.00004      0.0151          0.0152

Unlike STREAM, which computes rates using the minimum time values, QCDSTREAM uses the mean time and estimates the uncertainty using the standard deviation.

Matrix-Vector

In the "MatVec" section, QCDSTREAM calculates sustained MFlop/sec during matrix-vector multiplies. The matrices are 3x3 complex, and the vectors are 3x1 complex. There are two stanzas, labelled

MILC
MatVec

and SSE MatVec. The MILC version uses C-language code, and the SSE version uses inline GCC assembler.

Here is a sample extracted from the same Xeon run:


  Function        Rate (MFlop/s)          Mean time               Min time        Max time
  --------        --------------          ---------               --------        --------
  MILC MatVec
    In Cache:      780.4 +/-  0.2         0.0188 +/- 0.00001      0.0188          0.0188
    Sequential:    668.7 +/-  0.6         0.0219 +/- 0.00002      0.0219          0.0220
    Strided:       157.3 +/-  0.0         0.0932 +/- 0.00003      0.0932          0.0933
    PF+Strided:    163.7 +/-  0.0         0.0896 +/- 0.00002      0.0896          0.0897
    Mapped:        154.4 +/-  0.0         0.0950 +/- 0.00002      0.0950          0.0950
    PF+Mapped:     151.7 +/-  0.0         0.0967 +/- 0.00003      0.0967          0.0967
  SSE MatVec
    In Cache:     1790.5 +/-  1.3         0.0082 +/- 0.00001      0.0082          0.0082
    Sequential:    920.0 +/-  1.5         0.0159 +/- 0.00003      0.0159          0.0160
    Strided:       167.3 +/-  0.0         0.0877 +/- 0.00001      0.0876          0.0877
    PF+Strided:    168.1 +/-  0.0         0.0873 +/- 0.00003      0.0872          0.0873
    Mapped:        190.2 +/-  0.0         0.0771 +/- 0.00002      0.0771          0.0771
    PF+Mapped:     183.9 +/-  0.0         0.0797 +/- 0.00001      0.0797          0.0798

Within each stanza, a number of measurements are listed. QCDSTREAM allocates memory to three float pointers, float *a, *b, *c. Typecasting with su3_matrix and su3_vector data types is used to form the arguments to the matrix-vector routine. The measurement types are as follows:

In Cache - the same matrix and vector are used repeatedly:
mult_su3_mat_vec((su3_matrix *)(a), (su3_vector *)(b), (su3_vector *)(c));
Sequential - matrices and vectors are pulled sequentially from the arrays:
mult_su3_mat_vec((su3_matrix *)(a+j*18), (su3_vector *)(b+j*6), (su3_vector *)(c+j*6));
where j is the loop index.
Strided - matrices and vectors are pulled from the arrays every STRIDE values:
mult_su3_mat_vec((su3_matrix *)(a+(j*STRIDE)%N), (su3_vector *)(b+(j*STRIDE)%N), (su3_vector *)(c+(j*STRIDE)%N));
where N is the array length.
PF+Strided - identical to Strided, except that the next arguments are prefetched before the calculation:
   prefetch_matrix(a+((j+1)*STRIDE)%N);
   prefetch_vector(b+((j+1)*STRIDE)%N);
   mult_su3_mat_vec((su3_matrix *)(a+(j*STRIDE)%N), (su3_vector *)(b+(j*STRIDE)%N), (su3_vector *)(c+(j*STRIDE)%N));
Here the prefetch macros are coded using inline GCC assembler and the prefetchnta instruction.
Mapped - matrices and vectors are accessed via an indirect map, map1[j]:
mult_su3_mat_vec((su3_matrix *)(a+map1[j]), (su3_vector *)(b+map1[j]), (su3_vector *)(c+map1[j]));
Note that (a+map1[j]) is equivalent to (a[map1[j]])
PF+Mapped - same as Mapped, except that the next arguments are prefetched.

Matrix-Half_Wilson_Vector

In the "MatHWVec" section, 3x3 complex matrices are multiplied by "Half Wilson" vectors, which are defined to be a pair of 3x1 complex vectors.

A sample extracted from the same Xeon run:


  Function        Rate (MFlop/s)          Mean time               Min time        Max time
  --------        --------------          ---------               --------        --------
  MILC MatHWVec
    In Cache:      781.6 +/-  0.2         0.0375 +/- 0.00001      0.0375          0.0375
    Sequential:    708.8 +/-  0.3         0.0414 +/- 0.00002      0.0414          0.0414
    Strided:       237.0 +/-  0.0         0.1237 +/- 0.00002      0.1237          0.1238
    PF+Strided:    248.1 +/-  0.1         0.1182 +/- 0.00003      0.1182          0.1183
    Mapped:        227.4 +/-  0.0         0.1290 +/- 0.00002      0.1290          0.1290
    PF+Mapped:     226.9 +/-  0.0         0.1293 +/- 0.00003      0.1292          0.1293
  SSE MatHWVec
    In Cache:     2943.4 +/-  1.5         0.0100 +/- 0.00001      0.0100          0.0100
    Sequential:   1207.9 +/-  1.0         0.0243 +/- 0.00002      0.0243          0.0243
    Strided:       285.9 +/-  9.9         0.1026 +/- 0.00354      0.0984          0.1054
    PF+Strided:    292.4 +/-  0.1         0.1003 +/- 0.00002      0.1003          0.1004
    Mapped:        342.2 +/-  0.1         0.0857 +/- 0.00002      0.0857          0.0858
    PF+Mapped:     333.6 +/-  0.1         0.0879 +/- 0.00002      0.0879          0.0880

Matrix-Matrix

In the "MatMat" section, 3x3 complex matrix pairs are multiplied.

A sample extracted from the same Xeon run:


  Function        Rate (MFlop/s)          Mean time               Min time        Max time
  --------        --------------          ---------               --------        --------
  MILC MatMat
    In Cache:      838.5 +/-  0.1         0.0525 +/- 0.00001      0.0525          0.0525
    Sequential:    757.1 +/-  0.3         0.0581 +/- 0.00002      0.0581          0.0581
    Strided:       311.8 +/-  0.0         0.1411 +/- 0.00002      0.1411          0.1411
    PF+Strided:    318.1 +/-  2.7         0.1383 +/- 0.00118      0.1358          0.1389
    Mapped:        289.7 +/-  0.2         0.1519 +/- 0.00011      0.1518          0.1522
    PF+Mapped:     291.2 +/-  1.5         0.1511 +/- 0.00076      0.1507          0.1528
  SSE MatMat
    In Cache:     2562.1 +/-  1.0         0.0172 +/- 0.00001      0.0172          0.0172
    Sequential:   1382.8 +/-  1.2         0.0318 +/- 0.00003      0.0318          0.0319
    Strided:       328.9 +/-  0.1         0.1338 +/- 0.00004      0.1337          0.1338
    PF+Strided:    332.2 +/-  0.1         0.1324 +/- 0.00003      0.1324          0.1325
    Mapped:        409.3 +/-  0.1         0.1075 +/- 0.00003      0.1074          0.1076
    PF+Mapped:     402.1 +/-  0.0         0.1094 +/- 0.00001      0.1094          0.1094

Results

Sample results can be found as follows:

In general, prefetching is more effective on the Athlon MP than on the Xeon or Pentium III processors. The SSE routines give a much greater performance boost on the Xeon processors. Curiously, indirect (mapped) data access is faster on the Xeon processors for the SSE routines than is strided access.

The Code

The QCDSTREAM code is available here. For the curious, here's the source file and the README. To build, untar and use make all. Note that the SSE instructions will cause "illegal instruction" errors on non-patched 2.2.x kernels (2.4.x kernels all support SSE), and on early Athlon processors. See this discussion for details and a link to a patch for 2.2.x kernels.

A new version of QCDSTREAM code is available here. This version includes measurements of fully vectorized code, as well as measurements of SSE-assisted reads and writes.

Don Holmgren

Last Modified: 4th Sept 2003