nasm2c.pl
NASM to Inline GCC Translator
We've implemented a nasm-to-inline-gcc translator, nasm2c.pl, which generates gcc assembler macros from the
the NASM source codes. For this translator to succeed, the following
conventions must be used in NASM source codes:
push, mov, pop, add, and ret
operations must be
associated solely with the stack handling operations used to reference
arguments. Lines with these codes are deleted from the inline macros.
[...]
NASM
constructs) must include in the comment field on the same line
a construct, offset by <...>
, which gives a C-language
reference to the address, referenced by the macro argument. For
example, if the inline macro definition is
#define _inline_sse_mult_su3_nn(aa,bb,cc)where (
aa, bb, cc
) are of type (su3_matrix
*
), a possible source line containing a reference construct would be:
movss xmm3,[eax] ; <(aa)->e[0][0].real>Here, in the NASM version
[eax]
would dereference the
first argument, passed on the stack, to
mult_su3_nn(a,b,c)
.
In the translated inline version, where no stack is used to transfer
arguments, the code directly references a->e[0][0].real
.
aa, bb, cc
are assumed.
Override this in the invocation of nasm2c.pl, eg:
./nasm2c.pl sse_routine.nas "aa,bb0,bb1,bb2,bb3,cc"Note that there should be corresponding
<...>
reference constructs in your NASM code.
add_su3_vector
:
; ; sse_add_su3_vector( su3_vector *a, su3_vector *b, su3_vector *c) ; ; global sse_add_su3_vector sse_add_su3_vector: push ebp mov ebp,esp push eax push ebx push ecx mov eax,[ebp+8] ; su3_vector *a mov ebx,[ebp+12] ; su3_vector *b mov ecx,[ebp+16] ; su3_vector *c movups xmm0,[eax] ; <(aa)->c[0]> movlps xmm1,[eax+16] ; <(aa)->c[2]> shufps xmm1,xmm1,0x44 movups xmm2,[ebx] ; <(bb)->c[0]> movlps xmm3,[ebx+16] ; <(bb)->c[2]> shufps xmm3,xmm3,0x44 addps xmm0,xmm2 addps xmm1,xmm3 movups [ecx],xmm0 ; <(cc)->c[0]> movlps [ecx+16],xmm1 ; <(cc)->c[2]> here: pop ecx pop ebx pop eax mov esp,ebp pop ebp retThe translated version, generated by
./nasm2c.pl sse_addvec.nas > sse_addvec.hlooks like the following:
#define _inline_sse_add_su3_vector(aa,bb,cc) \ { \ __asm__ __volatile__ ("movups %0, %%xmm0 \n\t" \ "movlps %1, %%xmm1 \n\t" \ "shufps $0x44, %%xmm1, %%xmm1 \n\t" \ "movups %2, %%xmm2 \n\t" \ "movlps %3, %%xmm3 \n\t" \ "shufps $0x44, %%xmm3, %%xmm3 \n\t" \ "addps %%xmm2, %%xmm0 \n\t" \ "addps %%xmm3, %%xmm1 \n\t" \ : \ : \ "m" ((aa)->c[0]), \ "m" ((aa)->c[2]), \ "m" ((bb)->c[0]), \ "m" ((bb)->c[2])); \ __asm__ __volatile__ ("movups %%xmm0, %0 \n\t" \ "movlps %%xmm1, %1 \n\t" \ : \ "=m" ((cc)->c[0]), \ "=m" ((cc)->c[2])); \ }
In the table above, we've included a columns labelled Performance of the inline routines
The table below shows the routines currently with inline SSE implementations,
along with their cycle timings on Pentium III, Pentium 4, and Athlon MP chips. The
subroutines in red are used heavily in the improved
staggered D-slash routine. The "MILC" column gives timings using
C codes, and the "SSE" column gives timings using inline SSE codes.
Pentium III
Pentium IV
Athlon MP
FP Ops
MILC
SSE
MFlops/GHz
MILC
SSE
MFlops/GHz
MILC
SSE
MFlops/GHz
mult_su3_mat_vec
66
148 82 805
124 57 1158
113 92 717
mult_adj_su3_mat_vec
66
145 82 805
121 57 1158
110 92 717
mult_su3_mat_vec_sum_4dir
282
629 348 810
598 249 1133
533 372 758
mult_adj_su3_mat_vec_4dir
264
502 279 946
530 320 825
371 317 833
mult_adj_su3_mat_vec_4vec
264
509 324 815
534 228 1158
383 371 712
mult_adj_su3_mat_hwvec
132
300 106 1245
268 73 1808
192 120 1100
mult_su3_mat_hwvec
132
300 106 1245
268 73 1808
194 120 1100
mult_su3_nn
198
440 188 1053
368 139 1424
362 209 947
mult_su3_na
198
432 195 1015
422 135 1467
357 217 912
mult_su3_an
198
449 188 1053
414 130 1523
300 208 952
scalar_mult_add_su3_matrix
36
176 48 750
308 62 581
231 46 783
scalar_mult_add_su3_vector
12
35 15 800
35 18 667
24 16 750
add_su3_vector
6
27 13 462
26 19 316
21 13 462
sub_four_su3_vecs
24
73 34 706
113 47 511
57 33 727
su3_projector
54
182 82 659
204 77 701
167 83 651
FP Ops
and
MFlops/GHz
. The former gives the number of floating point
operations necessary to perform the calculation.
Multiply the values in the MFlops/GHz
column by the clock speed in GHz of the
corresponding processor to obtain the MFlop/sec
performance with
the inlined SSE codes. For
example, mult_su3_mat_hwvec
on a 2 GHz Pentium 4 processor will
perform at 2 X 1808 = 3616 MFlop/sec
.
Obtaining and using the inline routines
A tar file containing the NASM source code for these routines,
nasm2c.pl
, a makefile, regression testing and timing programs,
and header files is available here. The
header file inline_sse.h
should be included in any source file in
which you want to use these inline routines. You'll either need to edit all
invocations of the corresponding MILC math routines to add
_inline_sse_
prefixes, or use #define SSE_SUBS
to
turn on macros in inline_sse.h
which will transparently
substitute the inline versions.
See also the README_SSE file, and the inline_sse.h header file.