QCDSTREAM Benchmark README ========================== QCDSTREAM extends McCalpin's standard STREAM "Copy" benchmark (see http://www.cs.virginia.edu/stream/ to include SSE-assisted memory movement. It also investigates the performance of several very simple matrix algebra primitives of importance to lattice QCD calculations. The matrix algebra primitives all involve small complex matrices and vectors. The matrices are 3x3, the vectors are 3x1, and the vector pairs ("half wilson vectors") are 3x2. 3x3 complex single precision matrices occupy 72 bytes of storage, a length that interacts strongly with the cachelines in Pentium 4 processors. Since these processors use 64 byte cachelines, a memory load of one of these matrices is very inefficient, bringing 128 bytes into the L2 cache in order to work on a 72 byte quantity. For lattice QCD codes, one solution is to use data structures that pack matrices together. In this way the next matrix, which is usually used as the code iterates over all sites in the lattice, is partially preloaded. The following memory movements are timed: - copying a float array to a float array - copying a double array to a double array (like STREAM "Copy") - copying a long double array to a long double array, using SSE assist The following matrix algebra kernels are timed: - matrix-vector multiply - matrix-vector-pair multiply - matrix-matrix multiply For each of these kernels, the following memory access patterns can be used in the measurement: - "in cache": A[0] = B[0] * C[0], repeated - "sequential": (for i=0; i