Here is a sample extracted from a 1.7 GHz Pentium 4 Xeon (Foster) run:
Function Rate (MB/s) Mean time Min time Max time -------- ----------- --------- -------- -------- Float Copy: 1234.0 +/- 0.9 0.0260 +/- 0.00002 0.0259 0.0260 Double Copy: 1253.9 +/- 1.0 0.0256 +/- 0.00002 0.0255 0.0256 SSE Copy: 2121.2 +/- 4.9 0.0151 +/- 0.00004 0.0151 0.0152Unlike STREAM, which computes rates using the minimum time values, QCDSTREAM uses the mean time and estimates the uncertainty using the standard deviation.
MILC
MatVec and SSE MatVec.
The MILC version uses C-language code, and the SSE version uses inline GCC
assembler.
Here is a sample extracted from the same Xeon run:
Function Rate (MFlop/s) Mean time Min time Max time
-------- -------------- --------- -------- --------
MILC MatVec
In Cache: 780.4 +/- 0.2 0.0188 +/- 0.00001 0.0188 0.0188
Sequential: 668.7 +/- 0.6 0.0219 +/- 0.00002 0.0219 0.0220
Strided: 157.3 +/- 0.0 0.0932 +/- 0.00003 0.0932 0.0933
PF+Strided: 163.7 +/- 0.0 0.0896 +/- 0.00002 0.0896 0.0897
Mapped: 154.4 +/- 0.0 0.0950 +/- 0.00002 0.0950 0.0950
PF+Mapped: 151.7 +/- 0.0 0.0967 +/- 0.00003 0.0967 0.0967
SSE MatVec
In Cache: 1790.5 +/- 1.3 0.0082 +/- 0.00001 0.0082 0.0082
Sequential: 920.0 +/- 1.5 0.0159 +/- 0.00003 0.0159 0.0160
Strided: 167.3 +/- 0.0 0.0877 +/- 0.00001 0.0876 0.0877
PF+Strided: 168.1 +/- 0.0 0.0873 +/- 0.00003 0.0872 0.0873
Mapped: 190.2 +/- 0.0 0.0771 +/- 0.00002 0.0771 0.0771
PF+Mapped: 183.9 +/- 0.0 0.0797 +/- 0.00001 0.0797 0.0798
Within each stanza, a number of measurements are listed. QCDSTREAM allocates memory to
three float pointers, float *a, *b, *c. Typecasting with su3_matrix
and su3_vector data types is used to form the arguments to the matrix-vector
routine. The measurement types are as follows:
mult_su3_mat_vec((su3_matrix *)(a), (su3_vector *)(b), (su3_vector *)(c));
mult_su3_mat_vec((su3_matrix *)(a+j*18), (su3_vector *)(b+j*6), (su3_vector *)(c+j*6));
j is the loop index.
STRIDE values:
mult_su3_mat_vec((su3_matrix *)(a+(j*STRIDE)%N), (su3_vector *)(b+(j*STRIDE)%N), (su3_vector *)(c+(j*STRIDE)%N));
N is the array length.
prefetch_matrix(a+((j+1)*STRIDE)%N);
prefetch_vector(b+((j+1)*STRIDE)%N);
mult_su3_mat_vec((su3_matrix *)(a+(j*STRIDE)%N), (su3_vector *)(b+(j*STRIDE)%N), (su3_vector *)(c+(j*STRIDE)%N));
prefetchnta instruction.
map1[j]:
mult_su3_mat_vec((su3_matrix *)(a+map1[j]), (su3_vector *)(b+map1[j]), (su3_vector *)(c+map1[j]));
(a+map1[j]) is equivalent to (a[map1[j]])
A sample extracted from the same Xeon run:
Function Rate (MFlop/s) Mean time Min time Max time
-------- -------------- --------- -------- --------
MILC MatHWVec
In Cache: 781.6 +/- 0.2 0.0375 +/- 0.00001 0.0375 0.0375
Sequential: 708.8 +/- 0.3 0.0414 +/- 0.00002 0.0414 0.0414
Strided: 237.0 +/- 0.0 0.1237 +/- 0.00002 0.1237 0.1238
PF+Strided: 248.1 +/- 0.1 0.1182 +/- 0.00003 0.1182 0.1183
Mapped: 227.4 +/- 0.0 0.1290 +/- 0.00002 0.1290 0.1290
PF+Mapped: 226.9 +/- 0.0 0.1293 +/- 0.00003 0.1292 0.1293
SSE MatHWVec
In Cache: 2943.4 +/- 1.5 0.0100 +/- 0.00001 0.0100 0.0100
Sequential: 1207.9 +/- 1.0 0.0243 +/- 0.00002 0.0243 0.0243
Strided: 285.9 +/- 9.9 0.1026 +/- 0.00354 0.0984 0.1054
PF+Strided: 292.4 +/- 0.1 0.1003 +/- 0.00002 0.1003 0.1004
Mapped: 342.2 +/- 0.1 0.0857 +/- 0.00002 0.0857 0.0858
PF+Mapped: 333.6 +/- 0.1 0.0879 +/- 0.00002 0.0879 0.0880
A sample extracted from the same Xeon run:
Function Rate (MFlop/s) Mean time Min time Max time
-------- -------------- --------- -------- --------
MILC MatMat
In Cache: 838.5 +/- 0.1 0.0525 +/- 0.00001 0.0525 0.0525
Sequential: 757.1 +/- 0.3 0.0581 +/- 0.00002 0.0581 0.0581
Strided: 311.8 +/- 0.0 0.1411 +/- 0.00002 0.1411 0.1411
PF+Strided: 318.1 +/- 2.7 0.1383 +/- 0.00118 0.1358 0.1389
Mapped: 289.7 +/- 0.2 0.1519 +/- 0.00011 0.1518 0.1522
PF+Mapped: 291.2 +/- 1.5 0.1511 +/- 0.00076 0.1507 0.1528
SSE MatMat
In Cache: 2562.1 +/- 1.0 0.0172 +/- 0.00001 0.0172 0.0172
Sequential: 1382.8 +/- 1.2 0.0318 +/- 0.00003 0.0318 0.0319
Strided: 328.9 +/- 0.1 0.1338 +/- 0.00004 0.1337 0.1338
PF+Strided: 332.2 +/- 0.1 0.1324 +/- 0.00003 0.1324 0.1325
Mapped: 409.3 +/- 0.1 0.1075 +/- 0.00003 0.1074 0.1076
PF+Mapped: 402.1 +/- 0.0 0.1094 +/- 0.00001 0.1094 0.1094
make all. Note that the SSE instructions
will cause "illegal instruction" errors on non-patched 2.2.x kernels (2.4.x
kernels all support SSE), and on early Athlon processors. See
this discussion for
details and a link to a patch for 2.2.x kernels.
A new version of QCDSTREAM code is available here. This version includes measurements of fully vectorized code, as well as measurements of SSE-assisted reads and writes.