Inline SSE MILC Math Routines

NASM to Inline GCC Translator, nasm2c.pl
Performance of the inline routines
Obtaining and using the inline routines
Optimizing MILC Math Routines with SSE
Catalog of MILC Math Routines

Code written in NASM assembler tends to be easy to read, and the programmer can readily add documentation. Object files generated using nasm may be linked with C object modules generated by many compilers, including gcc, pgc (Portland Compiler Group), and icc (Intel C++ compiler). However, all routines must be implemented as subroutines, with a corresponding overhead penalty.

NASM to Inline GCC Translator

We've implemented a nasm-to-inline-gcc translator, nasm2c.pl, which generates gcc assembler macros from the the NASM source codes. For this translator to succeed, the following conventions must be used in NASM source codes:

All push, mov, pop, add, and ret operations must be associated solely with the stack handling operations used to reference arguments. Lines with these codes are deleted from the inline macros.
Any references to memory (i.e., using [...] NASM constructs) must include in the comment field on the same line a construct, offset by <...>, which gives a C-language reference to the address, referenced by the macro argument. For example, if the inline macro definition is
```
     #define _inline_sse_mult_su3_nn(aa,bb,cc)
```
where (aa, bb, cc) are of type (su3_matrix *), a possible source line containing a reference construct would be:
```
     movss  xmm3,[eax]       ; <(aa)->e[0][0].real>
```
Here, in the NASM version [eax] would dereference the first argument, passed on the stack, to mult_su3_nn(a,b,c). In the translated inline version, where no stack is used to transfer arguments, the code directly references a->e[0][0].real.
In the current version, all labels are ignored, and branches will not work.
By default, three macro arguments aa, bb, cc are assumed. Override this in the invocation of nasm2c.pl, eg:
```
   ./nasm2c.pl sse_routine.nas "aa,bb0,bb1,bb2,bb3,cc"
```
Note that there should be corresponding <...> reference constructs in your NASM code.

As an example, here is the NASM version of add_su3_vector:

;
; sse_add_su3_vector( su3_vector *a, su3_vector *b, su3_vector *c)
; 
;

global sse_add_su3_vector
sse_add_su3_vector:
        push            ebp
        mov             ebp,esp
        push            eax
        push            ebx
        push            ecx
        mov             eax,[ebp+8]                     ; su3_vector *a
        mov             ebx,[ebp+12]                    ; su3_vector *b
        mov             ecx,[ebp+16]                    ; su3_vector *c

        movups          xmm0,[eax]                      ;                       <(aa)->c[0]>
        movlps          xmm1,[eax+16]                   ;                       <(aa)->c[2]>
        shufps          xmm1,xmm1,0x44
        movups          xmm2,[ebx]                      ;                       <(bb)->c[0]>
        movlps          xmm3,[ebx+16]                   ;                       <(bb)->c[2]>
        shufps          xmm3,xmm3,0x44
        addps           xmm0,xmm2
        addps           xmm1,xmm3

        movups          [ecx],xmm0                      ;                       <(cc)->c[0]>
        movlps          [ecx+16],xmm1                   ;                       <(cc)->c[2]>

here:   pop     ecx
        pop     ebx
        pop     eax
        mov     esp,ebp
        pop     ebp
        ret

The translated version, generated by

   ./nasm2c.pl sse_addvec.nas > sse_addvec.h

looks like the following:

#define _inline_sse_add_su3_vector(aa,bb,cc) \
{ \
__asm__ __volatile__ ("movups %0, %%xmm0 \n\t" \
                      "movlps %1, %%xmm1 \n\t" \
                      "shufps $0x44, %%xmm1, %%xmm1 \n\t" \
                      "movups %2, %%xmm2 \n\t" \
                      "movlps %3, %%xmm3 \n\t" \
                      "shufps $0x44, %%xmm3, %%xmm3 \n\t" \
                      "addps %%xmm2, %%xmm0 \n\t" \
                      "addps %%xmm3, %%xmm1 \n\t" \
                      : \
                      : \
                      "m" ((aa)->c[0]), \
                      "m" ((aa)->c[2]), \
                      "m" ((bb)->c[0]), \
                      "m" ((bb)->c[2])); \
__asm__ __volatile__ ("movups %%xmm0, %0 \n\t" \
                      "movlps %%xmm1, %1 \n\t" \
                      : \
                      "=m" ((cc)->c[0]), \
                      "=m" ((cc)->c[2])); \
}

Performance of the inline routines

The table below shows the routines currently with inline SSE implementations, along with their cycle timings on Pentium III, Pentium 4, and Athlon MP chips. The subroutines in red are used heavily in the improved staggered D-slash routine. The "MILC" column gives timings using C codes, and the "SSE" column gives timings using inline SSE codes.

*Pentium III (Coppermine), Pentium IV, and Athlon MP Inline Timings*
		Pentium III			Pentium IV			Athlon MP
	FP Ops	MILC	SSE	MFlops/GHz	MILC	SSE	MFlops/GHz	MILC	SSE	MFlops/GHz
mult_su3_mat_vec	66	148	82	805	124	57	1158	113	92	717
mult_adj_su3_mat_vec	66	145	82	805	121	57	1158	110	92	717
mult_su3_mat_vec_sum_4dir	282	629	348	810	598	249	1133	533	372	758
mult_adj_su3_mat_vec_4dir	264	502	279	946	530	320	825	371	317	833
mult_adj_su3_mat_vec_4vec	264	509	324	815	534	228	1158	383	371	712
mult_adj_su3_mat_hwvec	132	300	106	1245	268	73	1808	192	120	1100
mult_su3_mat_hwvec	132	300	106	1245	268	73	1808	194	120	1100
mult_su3_nn	198	440	188	1053	368	139	1424	362	209	947
mult_su3_na	198	432	195	1015	422	135	1467	357	217	912
mult_su3_an	198	449	188	1053	414	130	1523	300	208	952
scalar_mult_add_su3_matrix	36	176	48	750	308	62	581	231	46	783
scalar_mult_add_su3_vector	12	35	15	800	35	18	667	24	16	750
add_su3_vector	6	27	13	462	26	19	316	21	13	462
sub_four_su3_vecs	24	73	34	706	113	47	511	57	33	727
su3_projector	54	182	82	659	204	77	701	167	83	651

In the table above, we've included a columns labelled FP Ops and MFlops/GHz. The former gives the number of floating point operations necessary to perform the calculation. Multiply the values in the MFlops/GHz column by the clock speed in GHz of the corresponding processor to obtain the MFlop/sec performance with the inlined SSE codes. For example, mult_su3_mat_hwvec on a 2 GHz Pentium 4 processor will perform at 2 X 1808 = 3616 MFlop/sec.

Obtaining and using the inline routines

A tar file containing the NASM source code for these routines, nasm2c.pl, a makefile, regression testing and timing programs, and header files is available here. The header file inline_sse.h should be included in any source file in which you want to use these inline routines. You'll either need to edit all invocations of the corresponding MILC math routines to add _inline_sse_ prefixes, or use #define SSE_SUBS to turn on macros in inline_sse.h which will transparently substitute the inline versions.

See also the README_SSE file, and the inline_sse.h header file.

Don Holmgren

Last Modified: 28th Jan 2002