The various SU3 matrix and vector operations performed by MILC are promising
candidates for vectorization. We'll use the matrix-vector multiplication
routine, mult_su3_mat_vec (see the
m_matvec.c
source module) as an example.
The source code for the SSE version of mult_su3_mat_vec is available
here. To understand the code, let's
start with the prototype and a typical call:
In the first lines of the SSE code, the addresses of the arguments are loaded
into registers eax, ebx, and ecx after first saving the
contents of these registers on the stack:
The strategy of the code is to load and save the b vector into xmm0-2,
preshuffled so as to line up real and imaginary parts for multiplication with
each row of the matrix.
C version
Here's the core C code from the MILC library:
void mult_su3_mat_vec( su3_matrix *a, su3_vector *b, su3_vector *c ){
int i,j,k;
register float t,ar,ai,br,bi,cr,ci;
for(i=0;i<3;i++){
ar=a->e[i][0].real; ai=a->e[i][0].imag;
br=b->c[0].real; bi=b->c[0].imag;
cr=ar*br; t=ai*bi; cr -= t;
ci=ar*bi; t=ai*br; ci += t;
ar=a->e[i][1].real; ai=a->e[i][1].imag;
br=b->c[1].real; bi=b->c[1].imag;
t=ar*br; cr += t; t=ai*bi; cr -= t;
t=ar*bi; ci += t; t=ai*br; ci += t;
ar=a->e[i][2].real; ai=a->e[i][2].imag;
br=b->c[2].real; bi=b->c[2].imag;
t=ar*br; cr += t; t=ai*bi; cr -= t;
t=ar*bi; ci += t; t=ai*br; ci += t;
c->c[i].real=cr;
c->c[i].imag=ci;
}
}
The defined types are:
typedef struct { /* standard complex number declaration for single- */
float real; /* precision complex numbers */
float imag;
} complex;
typedef struct { complex c[3]; } su3_vector;
typedef struct { complex e[3][3]; } su3_matrix;
SSE Version
For writing SSE code, we've used the NASM assembler, available from
http://sourceforge.net/projects/nasm. NASM has a very clean
syntax, with operand ordering matching Intel's documentation.
void sse_mult_su3_mat_vec(su3_matrix *, su3_vector *, su3_vector *);
su3_matrix a;
su3_vector b, c;
sse_mult_su3_mat_vec( &a, &b, &c);
global sse_mult_su3_mat_vec
sse_mult_su3_mat_vec:
push ebp
mov ebp,esp
push eax
push ebx
push ecx
mov eax,[ebp+8] ; su3_matrix *a
mov ebx,[ebp+12] ; su3_vector *b
mov ecx,[ebp+16] ; su3_vector *c
Note the declaration of the global entry point, which allows the object module
from this SSE code to be linked to the caller's C code.
; bring in real and imaginary b vector
movups xmm0,[ebx] ; c1i,c1r,c0i,c0r
movaps xmm1,xmm0
shufps xmm1,xmm1,0xB1 ; c1r,c1i,c0r,c0i
movhps xmm2,[ebx+16] ; c2i,c2r,x,x
shufps xmm2,xmm2,0xEB ; c2i,c2r,c2r,c2i
; xmm0: c1i, c1r, c0i, c0r
; xmm1: c1r, c1i, c0r, c0i
; xmm2: c2i, c2r, c2r, c2i
The comments here label the real and imaginary parts of a generic set of
complex operands, c0, c1, and c2. The movups
and movaps
operations move data from memory to the xmm
registers, and between the registers. movaps
can be used whenever
the operands are 16-byte aligned, or when moving between registers.
movups
is slower, but does not require alignment.
shufps
is used to shuffle the 32-bit subfields of the SSE
operands. Refer to the Intel
Instruction Set Reference Manual
for details on the bit field taken by shufps
to specify the
movement of subfields. In this case, from the comments we see that
shufps
was used twice to arrange the three complex pairs of the
SU3 vector into three xmm registers, with both real-imaginary and
imaginary-real orderings.
Next, the code brings in the first row of the matrix into xmm3-4, with shuffling used as shown by the comment:
; bring in first row of a matrix movups xmm3,[eax] ; c1i,c1r,c0i,c0r movups xmm4,[eax+8] ; c2i,c2r,c1i,c1r shufps xmm4,xmm4,0xEE ; c2i,c2r,c2i,c2r ; xmm3: a1i, a1r, a0i, a0r ; xmm4: a2i, a2r, a2i, a2r
Three four-wide multiplies are performed with the mulps
instruction to form the various real.real, imag.imag, real.imag, and imag.real
combinations of the two vectors:
movaps xmm5,xmm3 mulps xmm3,xmm0 mulps xmm5,xmm1 mulps xmm4,xmm2 ; xmm3: c1i.i, c1r.r, c0i.i, c0r.r ; xmm4: c2i.i, c2r.r, c2i.r, c2r.i ; xmm5: c1i.r, c1r.i, c0i.r, c0r.i
Next, shuffles are used to prepare the multiplied pairs for vectorized summing:
movaps xmm6,xmm5 shufps xmm6,xmm3,0x4E shufps xmm5,xmm3,0xE4 ; xmm4: c2i.i, c2r.r, c2i.r, c2r.i ; xmm5: c1i.i, c1r.r, c0i.r, c0r.i ; xmm6: c0i.i, c0r.r, c1i.r, c1r.i
Finally, a pair of four-wide adds forms the unsummed real and imaginary pieces of the result of multiplying the first row of the matrix by the vector:
addps xmm4,xmm5 addps xmm4,xmm6 ; xmm4: i-sum, r-sum, i.r-sum, r.i-sum [0]Next, analogous code multiplies the second row of the matrix by the vector, storing the unsummed real and imaginary pieces into xmm5. By use of
shufps
, the unsummed pieces in xmm4-5 can be rearranged to
optimize the required addition:
shufps xmm5,xmm7,0x22 shufps xmm4,xmm7,0x77 ; xmm4: i.r-sum[1], i-sum[1], i.r-sum[0], i-sum[0] ; xmm5: r.i-sum[1], r-sum[1], r.i-sum[0], r-sum[0] xorps xmm4,[negate] addps xmm5,xmm4 movups [ecx],xmm5
Here, the negate
bit mask is used to multiply the imag.imag
pieces by -1
before summing with the real.real pieces:
align 16 negate: dd 0x80000000 dd 0x00000000 dd 0x80000000 dd 0x00000000The final result of multiplying the vector by the first two rows, a pair of complex numbers, is stored with a single
movups
operation.
What remains is the multiplication of the third row of the matrix with the
vector. This is identical to the prior row by vector multiplies, with the
exception that a movhps
operation is used at the end to move the
single complex number to the third position of the result SU3 vector (see the
code).
At the end of the routine, the original values of the registers used to access
the operands are reloaded from the stack, and a ret
is used to
return to the calling program:
pop ecx pop ebx pop eax mov esp,ebp pop ebp ret
test_mul.c
code can be used to test the
correctness of the SSE routine, and to time the speeds of the MILC C and SSE
versions. It initializes the a and b operands with random
numbers, then in a loop calls the MILC and SSE versions iter
times, as specified by the user. The rdtscll
macro from
/usr/include/asm/msr.h is used to access the Pentium cycle clock.
The output of test_mul.c
shows the matrix and vector operands,
and the results from the C and SSE codes, as well as the number of cycles used
by each routine. Running in the real time queue (running as root
required) is necessary for accurate timings.
A sample run, with one million iterations:
qcdhome:~/version5/libraries$ su -c './test_mul 1000000' Password: 0.4+1.6i 1.7+0.9i 1.6-1.6i -0.5-1.4i 4.2-1.4i 4.2-1.4i 1.7+0.1i 0.3+0.2i -2.0-1.1i 2.0-2.0i 2.7-0.4i 2.7-0.4i -1.4+0.7i -2.0+0.1i -2.0-0.9i -1.3-0.4i 0.2+7.7i 0.2+7.7i Time per iteration: MILC: 133 SSE: 103
Pentium III | Pentium IV | Athlon MP | ||||
---|---|---|---|---|---|---|
MILC | SSE | MILC | SSE | MILC | SSE | |
mult_su3_mat_vec | 131 | 102 | 162 | 96 | 100 | 101 |
mult_adj_su3_mat_vec | 128 | 98 | 116 | 101 | 99 | 100 |
mult_su3_mat_vec_sum_4dir | 599 | 369 | 603 | 368 | 506 | 380 |
mult_adj_su3_mat_vec_4dir | 500 | 304 | 608 | 362 | 411 | 332 |
mult_adj_su3_mat_hwvec | 297 | 178 | 276 | 185 | 199 | 182 |
mult_su3_mat_hwvec | 297 | 184 | 276 | 176 | 199 | 189 |
mult_su3_nn | 421 | 257 | 399 | 251 | 405 | 270 |
mult_su3_na | 400 | 255 | 386 | 253 | 414 | 270 |
scalar_mult_add_su3_matrix | 142 | 64 | 152 | 74 | 139 | 57 |
su3_projector | 222 | 107 | 234 | 102 | 258 | 99 |
To convert the results in this tables to MFlop ratings, multiply the number of
operations by the CPU clock speed and divide by the listed cycle count. For
example, an SU3 matrix-matrix multiply requires 108 multiplies and 90 adds.
On a 1.4 GHz Pentium 4 processor, the MFlops for the mult_su3_nn
routine are:
(108 + 90 Flop) * (1400 Mcycles/sec) / 251 = 1104 MFlop/sec
Using GCC Instead of NASM
One disadvantage of using NASM for SSE routines is that writing inline
functions is not possible. Recent versions of the GNU binutils
(2.11.2) and gcc (2.95.3) packages allow use of inlined SSE code.
For detailed examples, see Martin Luescher's
code and
SSE-HOWTO document. Look at a sample of his
macros in sse.h to get the flavor of inlining
assembler in GCC.
New: See the inline version discussion
for information about a NASM-to-inline-gcc converter.
Kernel Requirements
The operating system needs to be aware of the SSE/SSE2 register stack, since
when switching processes this stack is part of a given program's context. In
the stable Linux kernels, only the 2.4.x series
support SSE/SSE2 by default. For earlier kernels, you'll have to apply a
patch. For 2.2.18, we've used Andrea Arcangeli's
PIII-7 patch, available from ftp.kernel.org.