- 05 Jul, 2017 3 commits
-
-
James Almer authored
This reverts commit 24bb7db4. noise has to after all be sign extended, not zero extended, on tests other than checkasm. Fixes most aac tests broken by the now reverted commit.
-
James Almer authored
noise needs to be zero extended and it can be done implicitly as a side effect in a subsequent instruction. Signed-off-by: James Almer <jamrial@gmail.com>
-
James Almer authored
Tested-by: Michael Niedermayer <michael@niedermayer.cc> Signed-off-by: James Almer <jamrial@gmail.com>
-
- 30 Jun, 2017 1 commit
-
-
James Almer authored
Tested-by: Michael Niedermayer <michael@niedermayer.cc> Signed-off-by: James Almer <jamrial@gmail.com>
-
- 26 Jun, 2017 1 commit
-
-
Matthieu Bouron authored
Suggested-by: James Almer <jamrial@gmail.com>
-
- 08 Jun, 2016 1 commit
-
-
James Almer authored
Signed-off-by: James Almer <jamrial@gmail.com>
-
- 29 Sep, 2015 1 commit
-
-
Michael Niedermayer authored
Fixes crash Fixes: flicker-1.scout3d21443372922.28.m4a Found-by: Dale Curtis <dalecurtis@google.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
-
- 11 Aug, 2015 1 commit
-
-
Henrik Gramner authored
The .text section is already 16-byte aligned by default on all supported platforms so `SECTION_TEXT` isn't any different from `SECTION .text`. Signed-off-by: Anton Khirnov <anton@khirnov.net>
-
- 07 Aug, 2015 1 commit
-
-
James Almer authored
Signed-off-by: James Almer <jamrial@gmail.com>
-
- 04 Aug, 2015 1 commit
-
-
Henrik Gramner authored
The .text section is already 16-byte aligned by default on all supported platforms so `SECTION_TEXT` isn't any different from `SECTION .text`.
-
- 25 Jan, 2015 2 commits
-
-
Christophe Gisquet authored
Before 2843 decicycles in ff_sbr_autocorrelate_sse3, 262086 runs, 58 skips After 2693 decicycles in ff_sbr_autocorrelate_sse3, 262117 runs, 27 skips Signed-off-by: James Almer <jamrial@gmail.com>
-
James Almer authored
2 to 2.5 times faster. Signed-off-by: James Almer <jamrial@gmail.com>
-
- 06 Aug, 2014 1 commit
-
-
Christophe Gisquet authored
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
-
- 15 May, 2014 1 commit
-
-
Christophe Gisquet authored
From 133 (unrolled av_intfloat32 C) to 59 cycles on Arrandale/Win64. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
-
- 13 Mar, 2014 1 commit
-
-
Diego Biurrun authored
This helps grepping for functions, among other things.
-
- 30 Aug, 2013 1 commit
-
-
Thilo Borgmann authored
-
- 10 May, 2013 1 commit
-
-
Christophe Gisquet authored
From 253 to 51 cycles on Arrandale and Win64. 44 cycles on SandyBridge. Signed-off-by: Anton Khirnov <anton@khirnov.net>
-
- 08 May, 2013 1 commit
-
-
Christophe Gisquet authored
MSVC complains about the 32bits addressing, while mingw/gcc does not. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
-
- 03 May, 2013 1 commit
-
-
Christophe Gisquet authored
Sandybridge: 47 cycles Having a loop counter is a 7 cycle gain. Unrolling is another 7 cycle gain. Working in reverse scan is another 6 cycles. Signed-off-by: Diego Biurrun <diego@biurrun.de>
-
- 24 Apr, 2013 1 commit
-
-
Michael Niedermayer authored
This should fix building with MSVC until someone can change the code so it works with MSVC Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
-
- 19 Apr, 2013 1 commit
-
-
Christophe Gisquet authored
233 to 105 cycles on Arrandale and Win64. Replacing the multiplication by s_m[m] by a pand and a pxor with appropriate vectors is slower. Unrolling is a 15 cycles win. A SSE version was 4 cycles slower. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
-
- 10 Apr, 2013 1 commit
-
-
Christophe Gisquet authored
From 253 to 51 cycles on Arrandale and Win64. 44 cycles on SandyBridge. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
-
- 08 Apr, 2013 1 commit
-
-
Christophe Gisquet authored
From 312 to 89/68 (sse/sse2) cycles on Arrandale and Win64. Sandybridge: 68/47 cycles. Having a loop counter is a 7 cycle gain. Unrolling is another 7 cycle gain. Working in reverse scan is another 6 cycles. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
-
- 05 Apr, 2013 2 commits
-
-
Christophe Gisquet authored
Timing on Arrandale: C SSE Win32: 57 44 Win64: 47 38 Unrolling and not storing mask both save some cycles. Signed-off-by: Diego Biurrun <diego@biurrun.de>
-
Christophe Gisquet authored
Timing on Arrandale: C SSE Win32: 57 44 Win64: 47 38 Unrolling and not storing mask both save some cycles. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
-
- 06 Jan, 2013 2 commits
-
-
Christophe Gisquet authored
255 to 174 cycles on Arrandale / Win64. Unrolling yields no gain. Signed-off-by: Diego Biurrun <diego@biurrun.de>
-
Christophe Gisquet authored
698 to 174 cycles on Arrandale. Unrolling is a 6 cycles gain. Signed-off-by: Diego Biurrun <diego@biurrun.de>
-
- 08 Dec, 2012 1 commit
-
-
Michael Niedermayer authored
Core I7 (Sandy Bridge) 135 to 107 cycles Core i5 (Arrandale) 162 to 142 (Thanks to Christophe Gisquet for testing) Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
-
- 07 Dec, 2012 1 commit
-
-
Christophe Gisquet authored
Start and end index are multiple of 2, therefore guaranteeing aligned access. Also, this allows to generate 4 floats per loop, keeping the alignment all along. Timing: - 32 bits: 326c -> 172c - 64 bits: 323c -> 156c Signed-off-by: Diego Biurrun <diego@biurrun.de>
-
- 30 Oct, 2012 2 commits
-
-
Diego Biurrun authored
This is more consistent with the way we handle C #includes and it simplifies the build system.
-
Diego Biurrun authored
This is necessary to allow refactoring some x86util macros with cpuflags.
-
- 04 Apr, 2012 1 commit
-
-
Christophe GISQUET authored
All the more required since the users are pure SSE functions. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
-
- 23 Mar, 2012 1 commit
-
-
Ronald S. Bultje authored
Prevents a signflip in the counter, and a subsequent crash because of overreads/overwrites. Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind CC: libav-stable@libav.org
-
- 21 Mar, 2012 1 commit
-
-
Reimar Döffinger authored
This is even potentially faster in this use-case. Should fix AAC SBR decoding on machines with SSE but not SSE2, fixing track issue #1041. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
-
- 07 Mar, 2012 1 commit
-
-
Reimar Döffinger authored
Since the values are floats, using the float operations makes sense, improves performance on some CPUs and makes the code SSE compatible instead of needing SSE2. Based on suggestion by Jason. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de> Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
-
- 06 Mar, 2012 1 commit
-
-
Reimar Döffinger authored
movq from SSE register _to_ memory is an SSE2 instruction. Use the SSE movlps function instead that does the same thing. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de> Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
-
- 23 Feb, 2012 2 commits
-
-
Christophe GISQUET authored
Unrolling the main loop to process, instead of 4 elements: - 8: minor gain of 2 cycles (not worth the extra object size) - 2: loss of 8 cycles. Assigning STEP to a register is a loss. Output address (Y) is almost always unaligned. Timings: - C (32/64 bits): 117/109 cycles - SSE: 57 cycles Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
-
Christophe GISQUET authored
The 32bits targets have been compiled with -mfpmath=sse for proper reference. sbr_sum_square C /32bits: 82c (unrolled)/102c C /64bits: 69c (unrolled)/82c SSE/32bits: 42c SSE/64bits: 31c Use of SSE4.1 dpps to perform the final sum is slower. Not unrolling to perform 8 operations in a loop yields 10 more cycles. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
-