QCDSTREAM Benchmark README
==========================

QCDSTREAM extends McCalpin's standard STREAM "Copy" benchmark (see
http://www.cs.virginia.edu/stream/ to include SSE-assisted memory movement.
It also investigates the performance of several very simple matrix algebra
primitives of importance to lattice QCD calculations.

The matrix algebra primitives all involve small complex matrices and vectors.
The matrices are 3x3, the vectors are 3x1, and the vector pairs ("half wilson
vectors") are 3x2.

3x3 complex single precision matrices occupy 72 bytes of storage, a length
that interacts strongly with the cachelines in Pentium 4 processors.  Since
these processors use 64 byte cachelines, a memory load of one of these
matrices is very inefficient, bringing 128 bytes into the L2 cache in order to
work on a 72 byte quantity.  For lattice QCD codes, one solution is to use
data structures that pack matrices together.  In this way the next matrix,
which is usually used as the code iterates over all sites in the lattice, is
partially preloaded.

The following memory movements are timed:

- copying a float array to a float array
- copying a double array to a double array (like STREAM "Copy")
- copying a long double array to a long double array, using SSE assist

The following matrix algebra kernels are timed:

- matrix-vector multiply
- matrix-vector-pair multiply
- matrix-matrix multiply

For each of these kernels, the following memory access patterns can be used in
the measurement:

- "in cache":  A[0] = B[0] * C[0], repeated
- "sequential": (for i=0; i<N; i++) A[i] = B[i] * C[i];
- "strided": (for i=0; i<N/stride; i++) {
                A[i*stride] = B[i*stride] * C[i*stride];
              }
- "mapped": (for (i=0; i<N; i++) {
	        A[map[i]] = B[map[i]] * C[map[i]];
            }

Optionally in the strided on mapped cases, measurement with prefetching of the
next sites can be done.

Optionally, for each matrix algebra kernel measurements using inlined
SSE-assisted code can be performed.

A sample, annotated run (1.7 GHz dual Xeon):

******************************
Function        Rate (MB/s)             Mean time               Min time        Max time
--------        -----------             ---------               --------        --------
Float Copy:     1229.8 +/-  3.7         0.0261 +/- 0.00008      0.0260          0.0263
Double Copy:    1270.1 +/-  1.4         0.0252 +/- 0.00003      0.0252          0.0253
SSE Copy:       2035.9 +/-  7.4         0.0158 +/- 0.00006      0.0157          0.0159
******************************
The middle line is the standard STREAM "Copy" (though the arrays are malloc'd
rather than statically allocated).  The SSE copy uses inline gcc assembler,
loading each long double into a 128-bit SSE register, then using a
cache-bypass write to send the value to the destination array.  Essentially
all of the performance increase with the SSE assist is due to the
cache-bypass.

******************************
Function        Rate (MFlop/s)          Mean time               Min time        Max time
--------        --------------          ---------               --------        --------
MILC MatVec
  In Cache:      775.5 +/-  0.6         0.0189 +/- 0.00002      0.0189          0.0190
  Sequential:    669.6 +/-  1.0         0.0219 +/- 0.00003      0.0219          0.0220
  Strided:       157.7 +/-  0.1         0.0930 +/- 0.00003      0.0929          0.0931
  PF+Strided:    164.2 +/-  0.1         0.0893 +/- 0.00003      0.0893          0.0894
  Mapped:        153.9 +/-  0.1         0.0953 +/- 0.00003      0.0952          0.0954
  PF+Mapped:     152.3 +/-  0.0         0.0963 +/- 0.00003      0.0962          0.0963
SSE MatVec
  In Cache:     1787.7 +/-  6.5         0.0082 +/- 0.00003      0.0082          0.0083
  Sequential:    922.7 +/-  1.1         0.0159 +/- 0.00002      0.0159          0.0159
  Strided:       166.3 +/-  0.0         0.0882 +/- 0.00003      0.0882          0.0882
  PF+Strided:    165.9 +/-  1.6         0.0884 +/- 0.00086      0.0859          0.0887
  Mapped:        189.9 +/-  0.1         0.0772 +/- 0.00006      0.0772          0.0774
  PF+Mapped:     185.6 +/-  0.1         0.0790 +/- 0.00002      0.0790          0.0790
******************************
The "MILC" label refers to the MILC lattice QCD code.  The C-language matrix
algebra routines are take from that code (see the MILC homepage for more info:
http://physics.indiana.edu/~sg/milc.html).   For both the C and SSE-assisted
versions, performance in MFlop per second is measured for each of the data
access patterns.  


Building QCDSTREAM requires a recent binutils (2.11 or newer) in order to
recognize the SSE operations.  Running with SSE enabled requires either a
2.4.x linux kernel, or a 2.2.x kernel patched to support SSE.

QCDSTREAM may be invoked with arguments to control which measurements are
taken:
  --memory      Do memory tests
  --sse         Do SSE tests
  --prefetch    Do prefetch tests
  --mapped      Do mapped linear algebra timings
  --strided     Do strided linear algebra timings
  --incache     Do in-cache linear algebra timings
  --seq         Do sequential linear algebra timings
  --matvec      Do matrix-vector timings
  --mathwvec    Do matrix-half-wilson-vector timings
  --matmat      Do matrix-matrix timings
  --all         Do all measurements (memory + linear algebra)   [Default]
  --allmat      Do all linear algebra measurements

To run on a machine w/out an SSE-enabled kernel, use
  qcdstream --memory --allmat --prefetch --mapped --strided --incache --seq