Here is a sample extracted from a 1.7 GHz Pentium 4 Xeon (Foster) run:
Function Rate (MB/s) Mean time Min time Max time -------- ----------- --------- -------- -------- Float Copy: 1234.0 +/- 0.9 0.0260 +/- 0.00002 0.0259 0.0260 Double Copy: 1253.9 +/- 1.0 0.0256 +/- 0.00002 0.0255 0.0256 SSE Copy: 2121.2 +/- 4.9 0.0151 +/- 0.00004 0.0151 0.0152Unlike STREAM, which computes rates using the minimum time values, QCDSTREAM uses the mean time and estimates the uncertainty using the standard deviation.
MILC
MatVec
and SSE MatVec
.
The MILC version uses C-language code, and the SSE version uses inline GCC
assembler.
Here is a sample extracted from the same Xeon run:
Function Rate (MFlop/s) Mean time Min time Max time -------- -------------- --------- -------- -------- MILC MatVec In Cache: 780.4 +/- 0.2 0.0188 +/- 0.00001 0.0188 0.0188 Sequential: 668.7 +/- 0.6 0.0219 +/- 0.00002 0.0219 0.0220 Strided: 157.3 +/- 0.0 0.0932 +/- 0.00003 0.0932 0.0933 PF+Strided: 163.7 +/- 0.0 0.0896 +/- 0.00002 0.0896 0.0897 Mapped: 154.4 +/- 0.0 0.0950 +/- 0.00002 0.0950 0.0950 PF+Mapped: 151.7 +/- 0.0 0.0967 +/- 0.00003 0.0967 0.0967 SSE MatVec In Cache: 1790.5 +/- 1.3 0.0082 +/- 0.00001 0.0082 0.0082 Sequential: 920.0 +/- 1.5 0.0159 +/- 0.00003 0.0159 0.0160 Strided: 167.3 +/- 0.0 0.0877 +/- 0.00001 0.0876 0.0877 PF+Strided: 168.1 +/- 0.0 0.0873 +/- 0.00003 0.0872 0.0873 Mapped: 190.2 +/- 0.0 0.0771 +/- 0.00002 0.0771 0.0771 PF+Mapped: 183.9 +/- 0.0 0.0797 +/- 0.00001 0.0797 0.0798Within each stanza, a number of measurements are listed. QCDSTREAM allocates memory to three float pointers,
float *a, *b, *c
. Typecasting with su3_matrix
and su3_vector
data types is used to form the arguments to the matrix-vector
routine. The measurement types are as follows:
mult_su3_mat_vec((su3_matrix *)(a), (su3_vector *)(b), (su3_vector *)(c));
mult_su3_mat_vec((su3_matrix *)(a+j*18), (su3_vector *)(b+j*6), (su3_vector *)(c+j*6));
j
is the loop index.
STRIDE
values:
mult_su3_mat_vec((su3_matrix *)(a+(j*STRIDE)%N), (su3_vector *)(b+(j*STRIDE)%N), (su3_vector *)(c+(j*STRIDE)%N));
N
is the array length.
prefetch_matrix(a+((j+1)*STRIDE)%N);
prefetch_vector(b+((j+1)*STRIDE)%N);
mult_su3_mat_vec((su3_matrix *)(a+(j*STRIDE)%N), (su3_vector *)(b+(j*STRIDE)%N), (su3_vector *)(c+(j*STRIDE)%N));
prefetchnta
instruction.
map1[j]
:
mult_su3_mat_vec((su3_matrix *)(a+map1[j]), (su3_vector *)(b+map1[j]), (su3_vector *)(c+map1[j]));
(a+map1[j])
is equivalent to (a[map1[j]])
A sample extracted from the same Xeon run:
Function Rate (MFlop/s) Mean time Min time Max time -------- -------------- --------- -------- -------- MILC MatHWVec In Cache: 781.6 +/- 0.2 0.0375 +/- 0.00001 0.0375 0.0375 Sequential: 708.8 +/- 0.3 0.0414 +/- 0.00002 0.0414 0.0414 Strided: 237.0 +/- 0.0 0.1237 +/- 0.00002 0.1237 0.1238 PF+Strided: 248.1 +/- 0.1 0.1182 +/- 0.00003 0.1182 0.1183 Mapped: 227.4 +/- 0.0 0.1290 +/- 0.00002 0.1290 0.1290 PF+Mapped: 226.9 +/- 0.0 0.1293 +/- 0.00003 0.1292 0.1293 SSE MatHWVec In Cache: 2943.4 +/- 1.5 0.0100 +/- 0.00001 0.0100 0.0100 Sequential: 1207.9 +/- 1.0 0.0243 +/- 0.00002 0.0243 0.0243 Strided: 285.9 +/- 9.9 0.1026 +/- 0.00354 0.0984 0.1054 PF+Strided: 292.4 +/- 0.1 0.1003 +/- 0.00002 0.1003 0.1004 Mapped: 342.2 +/- 0.1 0.0857 +/- 0.00002 0.0857 0.0858 PF+Mapped: 333.6 +/- 0.1 0.0879 +/- 0.00002 0.0879 0.0880
A sample extracted from the same Xeon run:
Function Rate (MFlop/s) Mean time Min time Max time -------- -------------- --------- -------- -------- MILC MatMat In Cache: 838.5 +/- 0.1 0.0525 +/- 0.00001 0.0525 0.0525 Sequential: 757.1 +/- 0.3 0.0581 +/- 0.00002 0.0581 0.0581 Strided: 311.8 +/- 0.0 0.1411 +/- 0.00002 0.1411 0.1411 PF+Strided: 318.1 +/- 2.7 0.1383 +/- 0.00118 0.1358 0.1389 Mapped: 289.7 +/- 0.2 0.1519 +/- 0.00011 0.1518 0.1522 PF+Mapped: 291.2 +/- 1.5 0.1511 +/- 0.00076 0.1507 0.1528 SSE MatMat In Cache: 2562.1 +/- 1.0 0.0172 +/- 0.00001 0.0172 0.0172 Sequential: 1382.8 +/- 1.2 0.0318 +/- 0.00003 0.0318 0.0319 Strided: 328.9 +/- 0.1 0.1338 +/- 0.00004 0.1337 0.1338 PF+Strided: 332.2 +/- 0.1 0.1324 +/- 0.00003 0.1324 0.1325 Mapped: 409.3 +/- 0.1 0.1075 +/- 0.00003 0.1074 0.1076 PF+Mapped: 402.1 +/- 0.0 0.1094 +/- 0.00001 0.1094 0.1094
make all
. Note that the SSE instructions
will cause "illegal instruction" errors on non-patched 2.2.x kernels (2.4.x
kernels all support SSE), and on early Athlon processors. See
this discussion for
details and a link to a patch for 2.2.x kernels.
A new version of QCDSTREAM code is available here. This version includes measurements of fully vectorized code, as well as measurements of SSE-assisted reads and writes.