Commits · c950beb68dee016e0e0a1b729d40abf700d32d1a · Linshizhi / ffmpeg.wasm-core

05 Jul, 2017 3 commits

Revert "x86/sbrdsp: remove unnecessary sign extend instruction in apply_noise_main" · 9d5e81d3

James Almer authored 7 years ago

This reverts commit 24bb7db4.

noise has to after all be sign extended, not zero extended, on tests
other than checkasm.
Fixes most aac tests broken by the now reverted commit.

9d5e81d3

x86/sbrdsp: remove unnecessary sign extend instruction in apply_noise_main · 24bb7db4

James Almer authored 7 years ago

noise needs to be zero extended and it can be done implicitly as a side effect
in a subsequent instruction.
Signed-off-by: James Almer <jamrial@gmail.com>

24bb7db4

x86/sbrdsp: zero extend m_max in apply_noise_main · bcbe9e44

James Almer authored 7 years ago

Tested-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>

bcbe9e44

30 Jun, 2017 1 commit
- x86/sbrdsp: sign extend start and end gprs in ff_sbr_hf_gen_sse · ac8ad8d0
  James Almer authored 7 years ago
```
Tested-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
```
  ac8ad8d0
26 Jun, 2017 1 commit
- lavc/x86: clear r2 higher bits in ff_sbr_sum_square · db5bf64b
  Matthieu Bouron authored 7 years ago
```
Suggested-by: James Almer <jamrial@gmail.com>
```
  db5bf64b
08 Jun, 2016 1 commit
- x86/aacdec: use HADDPS macro · 82dbfcca
  James Almer authored 8 years ago
```
Signed-off-by: James Almer <jamrial@gmail.com>
```
  82dbfcca
29 Sep, 2015 1 commit

avcodec/x86/sbrdsp: Fix using uninitialized upper 32bit of noise · 1b82b934

Michael Niedermayer authored 9 years ago

Fixes crash
Fixes: flicker-1.scout3d21443372922.28.m4a
Found-by: Dale Curtis <dalecurtis@google.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

1b82b934

11 Aug, 2015 1 commit

x86inc: Drop SECTION_TEXT macro · ab43beef

Henrik Gramner authored 9 years ago

The .text section is already 16-byte aligned by default on all supported
platforms so `SECTION_TEXT` isn't any different from `SECTION .text`.
Signed-off-by: Anton Khirnov <anton@khirnov.net>

ab43beef

07 Aug, 2015 1 commit
- x86/sbrdsp: remove an unnecessary mova in sbr_autocorrelate · 9c0407e8
  James Almer authored 9 years ago
```
Signed-off-by: James Almer <jamrial@gmail.com>
```
  9c0407e8
04 Aug, 2015 1 commit

x86inc: Drop SECTION_TEXT macro · f0b7882c

Henrik Gramner authored 9 years ago

The .text section is already 16-byte aligned by default on all supported
platforms so `SECTION_TEXT` isn't any different from `SECTION .text`.

f0b7882c

25 Jan, 2015 2 commits

x86/sbrdsp: Use different mem moves · 7aeafacf

Christophe Gisquet authored 9 years ago

Before
2843 decicycles in ff_sbr_autocorrelate_sse3, 262086 runs, 58 skips

After
2693 decicycles in ff_sbr_autocorrelate_sse3, 262117 runs, 27 skips
Signed-off-by: James Almer <jamrial@gmail.com>

7aeafacf

x86/sbrdsp: add ff_sbr_autocorrelate_{sse,sse3} · 449b21bf
James Almer authored 9 years ago
```
2 to 2.5 times faster.
Signed-off-by: James Almer <jamrial@gmail.com>
```
449b21bf

06 Aug, 2014 1 commit
- x86: sbrdsp/fft: reuse ps_neg constant · 75837e9a
  Christophe Gisquet authored 10 years ago
```
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
```
  75837e9a
15 May, 2014 1 commit

x86: sbrdsp: implement SSE qmf_deint_neg · d1310c59

Christophe Gisquet authored 12 years ago

From 133 (unrolled av_intfloat32 C) to 59 cycles on Arrandale/Win64.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>

d1310c59

13 Mar, 2014 1 commit
- x86: Make function prototype comments in assembly code consistent · 55519926
  Diego Biurrun authored 10 years ago
```
This helps grepping for functions, among other things.
```
  55519926
30 Aug, 2013 1 commit
- Reinstate proper FFmpeg license for all files. · d814a839
  Thilo Borgmann authored 11 years ago
  
  d814a839
10 May, 2013 1 commit

x86: sbrdsp: implement SSE2 qmf_pre_shuffle · 2c299d41

Christophe Gisquet authored 12 years ago

From 253 to 51 cycles on Arrandale and Win64.
44 cycles on SandyBridge.
Signed-off-by: Anton Khirnov <anton@khirnov.net>

2c299d41

08 May, 2013 1 commit

x86: sbrdsp: force PIC addressing for Win64 · fc37cd43

Christophe Gisquet authored 11 years ago

MSVC complains about the 32bits addressing, while mingw/gcc does not.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>

fc37cd43

03 May, 2013 1 commit

x86: sbrdsp: Implement SSE2 qmf_deint_bfly · 5a97469a

Christophe Gisquet authored 11 years ago

Sandybridge: 47 cycles

Having a loop counter is a 7 cycle gain.
Unrolling is another 7 cycle gain.
Working in reverse scan is another 6 cycles.
Signed-off-by: Diego Biurrun <diego@biurrun.de>

5a97469a

24 Apr, 2013 1 commit

avcodec/x86/sbrdsp_init: disable using the noise code in x86_64 MSVC, Try #2 · fc690333

Michael Niedermayer authored 11 years ago

    This should fix building with MSVC until someone can change the
    code so it works with MSVC
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>

fc690333

19 Apr, 2013 1 commit

x86: sbrdsp: implement SSE2 hf_apply_noise · 76c72773

Christophe Gisquet authored 11 years ago

233 to 105 cycles on Arrandale and Win64.
Replacing the multiplication by s_m[m] by a pand and a pxor with
appropriate vectors is slower. Unrolling is a 15 cycles win.
A SSE version was 4 cycles slower.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>

76c72773

10 Apr, 2013 1 commit

x86: sbrdsp: implement SSE2 qmf_pre_shuffle · 2383068c

Christophe Gisquet authored 11 years ago

From 253 to 51 cycles on Arrandale and Win64.
44 cycles on SandyBridge.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>

2383068c

08 Apr, 2013 1 commit

x86: sbrdsp: implement SSE qmf_deint_bfly · e2946e5c

Christophe Gisquet authored 11 years ago

From 312 to 89/68 (sse/sse2) cycles on Arrandale and Win64.
Sandybridge: 68/47 cycles.

Having a loop counter is a 7 cycle gain.
Unrolling is another 7 cycle gain.
Working in reverse scan is another 6 cycles.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>

e2946e5c

05 Apr, 2013 2 commits

x86: sbrdsp: Implement SSE neg_odd_64 · f4b0d12f

Christophe Gisquet authored 12 years ago

Timing on Arrandale:
        C   SSE
Win32:  57   44
Win64:  47   38
Unrolling and not storing mask both save some cycles.
Signed-off-by: Diego Biurrun <diego@biurrun.de>

f4b0d12f

x86: sbrdsp: implement SSE neg_odd_64 · 37a97083

Christophe Gisquet authored 11 years ago

Timing on Arrandale:
        C   SSE
Win32:  57   44
Win64:  47   38
Unrolling and not storing mask both save some cycles.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>

37a97083

06 Jan, 2013 2 commits

x86: sbrdsp: Implement SSE qmf_post_shuffle · 4f506466

Christophe Gisquet authored 12 years ago

255 to 174 cycles on Arrandale / Win64. Unrolling yields no gain.
Signed-off-by: Diego Biurrun <diego@biurrun.de>

4f506466

x86: sbrdsp: Implement SSE sum64x5 · 44a0036d

Christophe Gisquet authored 12 years ago

698 to 174 cycles on Arrandale. Unrolling is a 6 cycles gain.
Signed-off-by: Diego Biurrun <diego@biurrun.de>

44a0036d

08 Dec, 2012 1 commit

sbr_hf_gen_sse: Optimize code a bit more. · 0110108a

Michael Niedermayer authored 12 years ago

Core I7 (Sandy Bridge) 135 to 107 cycles
Core i5 (Arrandale) 162 to 142 (Thanks to Christophe Gisquet for testing)
Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>

0110108a

07 Dec, 2012 1 commit

SBR DSP x86: implement SSE sbr_hf_gen · 2aef3d66

Christophe Gisquet authored 12 years ago

Start and end index are multiple of 2, therefore guaranteeing aligned access.
Also, this allows to generate 4 floats per loop, keeping the alignment all
along.

Timing:
- 32 bits: 326c -> 172c
- 64 bits: 323c -> 156c
Signed-off-by: Diego Biurrun <diego@biurrun.de>

2aef3d66

30 Oct, 2012 2 commits
- x86: yasm: Use complete source path for macro helper %includes · 04581c8c
  Diego Biurrun authored 12 years ago
```
This is more consistent with the way we handle C #includes and
it simplifies the build system.
```
  04581c8c
- x86: include x86inc.asm in x86util.asm · 6860b408
  Diego Biurrun authored 12 years ago
```
This is necessary to allow refactoring some x86util macros with cpuflags.
```
  6860b408
04 Apr, 2012 1 commit
- dsputil x86: use SSE float instruction instead of SSE2 integer equivalent · 6b81da2f
  Christophe GISQUET authored 12 years ago
```
All the more required since the users are pure SSE functions.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
```
  6b81da2f
23 Mar, 2012 1 commit

aacsbr: handle m_max values smaller than 4. · 71ea2681

Ronald S. Bultje authored 12 years ago

Prevents a signflip in the counter, and a subsequent crash because of
overreads/overwrites.

Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC: libav-stable@libav.org

71ea2681

21 Mar, 2012 1 commit

Replace SSE2 instruction by SSE equivalent. · 89411ae6

Reimar Döffinger authored 12 years ago

This is even potentially faster in this use-case.
Should fix AAC SBR decoding on machines with SSE but not
SSE2, fixing track issue #1041.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>

89411ae6

07 Mar, 2012 1 commit

sbrdsp.asm: convert all instructions to float/SSE ones. · 6eda85e1

Reimar Döffinger authored 12 years ago

Since the values are floats, using the float operations
makes sense, improves performance on some CPUs and
makes the code SSE compatible instead of needing SSE2.

Based on suggestion by Jason.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>

6eda85e1

06 Mar, 2012 1 commit

SBR DSP: fix SSE code to not use SSE2 instructions. · b5161908

Reimar Döffinger authored 12 years ago

movq from SSE register _to_ memory is an SSE2 instruction.
Use the SSE movlps function instead that does the same thing.
Signed-off-by: Reimar DÃ¶ffinger <Reimar.Doeffinger@gmx.de>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>

b5161908

23 Feb, 2012 2 commits

SBR DSP x86: implement SSE sbr_hf_g_filt · 2784d187

Christophe GISQUET authored 12 years ago

Unrolling the main loop to process, instead of 4 elements:
- 8: minor gain of 2 cycles (not worth the extra object size)
- 2: loss of 8 cycles.

Assigning STEP to a register is a loss. Output address (Y) is almost always
unaligned.

Timings:
- C (32/64 bits): 117/109 cycles
- SSE: 57 cycles
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>

2784d187

SBR DSP x86: implement SSE sbr_sum_square_sse · 34454c76

Christophe GISQUET authored 12 years ago

The 32bits targets have been compiled with -mfpmath=sse for proper reference.
sbr_sum_square C  /32bits: 82c (unrolled)/102c
               C  /64bits: 69c (unrolled)/82c
               SSE/32bits: 42c
               SSE/64bits: 31c

Use of SSE4.1 dpps to perform the final sum is slower.
Not unrolling to perform 8 operations in a loop yields 10 more cycles.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>

34454c76