x86: synth filter float: implement SSE2 version
Timings for Arrandale:
C SSE
win32: 2108 334
win64: 1152 322
Factorizing the inner loop with a call/jmp is a >15 cycles cost, even with
the jmp destination being aligned.
Unrolling for ARCH_X86_64 is a 20 cycles gain.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Showing
Please
register
or
sign in
to comment