Inline SSE MILC Math Routines

Code written in NASM assembler tends to be easy to read, and the programmer can readily add documentation. Object files generated using nasm may be linked with C object modules generated by many compilers, including gcc, pgc (Portland Compiler Group), and icc (Intel C++ compiler). However, all routines must be implemented as subroutines, with a corresponding overhead penalty.

NASM to Inline GCC Translator

We've implemented a nasm-to-inline-gcc translator, nasm2c.pl, which generates gcc assembler macros from the the NASM source codes. For this translator to succeed, the following conventions must be used in NASM source codes: As an example, here is the NASM version of add_su3_vector:
;
; sse_add_su3_vector( su3_vector *a, su3_vector *b, su3_vector *c)
; 
;

global sse_add_su3_vector
sse_add_su3_vector:
        push            ebp
        mov             ebp,esp
        push            eax
        push            ebx
        push            ecx
        mov             eax,[ebp+8]                     ; su3_vector *a
        mov             ebx,[ebp+12]                    ; su3_vector *b
        mov             ecx,[ebp+16]                    ; su3_vector *c

        movups          xmm0,[eax]                      ;                       <(aa)->c[0]>
        movlps          xmm1,[eax+16]                   ;                       <(aa)->c[2]>
        shufps          xmm1,xmm1,0x44
        movups          xmm2,[ebx]                      ;                       <(bb)->c[0]>
        movlps          xmm3,[ebx+16]                   ;                       <(bb)->c[2]>
        shufps          xmm3,xmm3,0x44
        addps           xmm0,xmm2
        addps           xmm1,xmm3

        movups          [ecx],xmm0                      ;                       <(cc)->c[0]>
        movlps          [ecx+16],xmm1                   ;                       <(cc)->c[2]>

here:   pop     ecx
        pop     ebx
        pop     eax
        mov     esp,ebp
        pop     ebp
        ret
The translated version, generated by
   ./nasm2c.pl sse_addvec.nas > sse_addvec.h
looks like the following:
#define _inline_sse_add_su3_vector(aa,bb,cc) \
{ \
__asm__ __volatile__ ("movups %0, %%xmm0 \n\t" \
                      "movlps %1, %%xmm1 \n\t" \
                      "shufps $0x44, %%xmm1, %%xmm1 \n\t" \
                      "movups %2, %%xmm2 \n\t" \
                      "movlps %3, %%xmm3 \n\t" \
                      "shufps $0x44, %%xmm3, %%xmm3 \n\t" \
                      "addps %%xmm2, %%xmm0 \n\t" \
                      "addps %%xmm3, %%xmm1 \n\t" \
                      : \
                      : \
                      "m" ((aa)->c[0]), \
                      "m" ((aa)->c[2]), \
                      "m" ((bb)->c[0]), \
                      "m" ((bb)->c[2])); \
__asm__ __volatile__ ("movups %%xmm0, %0 \n\t" \
                      "movlps %%xmm1, %1 \n\t" \
                      : \
                      "=m" ((cc)->c[0]), \
                      "=m" ((cc)->c[2])); \
}

Performance of the inline routines

The table below shows the routines currently with inline SSE implementations, along with their cycle timings on Pentium III, Pentium 4, and Athlon MP chips. The subroutines in red are used heavily in the improved staggered D-slash routine. The "MILC" column gives timings using C codes, and the "SSE" column gives timings using inline SSE codes.

Pentium III (Coppermine), Pentium IV, and Athlon MP Inline Timings
Pentium III Pentium IV Athlon MP
FP Ops MILC SSE MFlops/GHz MILC SSE MFlops/GHz MILC SSE MFlops/GHz
mult_su3_mat_vec 66 148 82 805 124 57 1158 113 92 717
mult_adj_su3_mat_vec 66 145 82 805 121 57 1158 110 92 717
mult_su3_mat_vec_sum_4dir 282 629 348 810 598 249 1133 533 372 758
mult_adj_su3_mat_vec_4dir 264 502 279 946 530 320 825 371 317 833
mult_adj_su3_mat_vec_4vec 264 509 324 815 534 228 1158 383 371 712
mult_adj_su3_mat_hwvec 132 300 106 1245 268 73 1808 192 120 1100
mult_su3_mat_hwvec 132 300 106 1245 268 73 1808 194 120 1100
mult_su3_nn 198 440 188 1053 368 139 1424 362 209 947
mult_su3_na 198 432 195 1015 422 135 1467 357 217 912
mult_su3_an 198 449 188 1053 414 130 1523 300 208 952
scalar_mult_add_su3_matrix 36 176 48 750 308 62 581 231 46 783
scalar_mult_add_su3_vector 12 35 15 800 35 18 667 24 16 750
add_su3_vector 6 27 13 462 26 19 316 21 13 462
sub_four_su3_vecs 24 73 34 706 113 47 511 57 33 727
su3_projector 54 182 82 659 204 77 701 167 83 651

In the table above, we've included a columns labelled FP Ops and MFlops/GHz. The former gives the number of floating point operations necessary to perform the calculation. Multiply the values in the MFlops/GHz column by the clock speed in GHz of the corresponding processor to obtain the MFlop/sec performance with the inlined SSE codes. For example, mult_su3_mat_hwvec on a 2 GHz Pentium 4 processor will perform at 2 X 1808 = 3616 MFlop/sec.

Obtaining and using the inline routines

A tar file containing the NASM source code for these routines, nasm2c.pl, a makefile, regression testing and timing programs, and header files is available here. The header file inline_sse.h should be included in any source file in which you want to use these inline routines. You'll either need to edit all invocations of the corresponding MILC math routines to add _inline_sse_ prefixes, or use #define SSE_SUBS to turn on macros in inline_sse.h which will transparently substitute the inline versions.

See also the README_SSE file, and the inline_sse.h header file.


Don Holmgren
Last Modified: 28th Jan 2002