Commits · 0219dc6c072586a14c641d108ef3e7da70fecae7 · Linshizhi / ffmpeg.wasm-core

26 Jul, 2016 1 commit

vp9: add 32x32 idct AVX2 implementation. · 726501a3

About 1.8x speedup compared to AVX version for full IDCT. Other
sub-IDCT scenarios also see speedups. Full --bench output for
idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles):

nop: 16.5
vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4
vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0
vp9_inv_dct_dct_32x32_add_8_1_ssse3: 137.4
vp9_inv_dct_dct_32x32_add_8_1_avx: 137.1
vp9_inv_dct_dct_32x32_add_8_1_avx2: 73.2
vp9_inv_dct_dct_32x32_add_8_2_c: 14680.8
vp9_inv_dct_dct_32x32_add_8_2_sse2: 2617.2
vp9_inv_dct_dct_32x32_add_8_2_ssse3: 982.9
vp9_inv_dct_dct_32x32_add_8_2_avx: 958.5
vp9_inv_dct_dct_32x32_add_8_2_avx2: 704.2
vp9_inv_dct_dct_32x32_add_8_4_c: 14443.1
vp9_inv_dct_dct_32x32_add_8_4_sse2: 2717.1
vp9_inv_dct_dct_32x32_add_8_4_ssse3: 965.7
vp9_inv_dct_dct_32x32_add_8_4_avx: 1000.7
vp9_inv_dct_dct_32x32_add_8_4_avx2: 717.1
vp9_inv_dct_dct_32x32_add_8_8_c: 14436.4
vp9_inv_dct_dct_32x32_add_8_8_sse2: 2671.8
vp9_inv_dct_dct_32x32_add_8_8_ssse3: 1038.5
vp9_inv_dct_dct_32x32_add_8_8_avx: 983.0
vp9_inv_dct_dct_32x32_add_8_8_avx2: 729.4
vp9_inv_dct_dct_32x32_add_8_16_c: 14614.7
vp9_inv_dct_dct_32x32_add_8_16_sse2: 2701.7
vp9_inv_dct_dct_32x32_add_8_16_ssse3: 1334.4
vp9_inv_dct_dct_32x32_add_8_16_avx: 1276.7
vp9_inv_dct_dct_32x32_add_8_16_avx2: 719.5
vp9_inv_dct_dct_32x32_add_8_32_c: 14363.6
vp9_inv_dct_dct_32x32_add_8_32_sse2: 2575.6
vp9_inv_dct_dct_32x32_add_8_32_ssse3: 2633.9
vp9_inv_dct_dct_32x32_add_8_32_avx: 2539.6
vp9_inv_dct_dct_32x32_add_8_32_avx2: 1395.0

726501a3

11 Jul, 2016 1 commit

vp9: add 16x16 idct avx2 (8-bit). · f0a2b624

Ronald S. Bultje authored 8 years ago

checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
that it's about 1.65x as fast as the AVX version for the full IDCT, and
similar speedups for the sub-IDCTs:

nop: 24.6
vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8
vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6
vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4
vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2
vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5
vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7
vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9
vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2
vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9
vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3
vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7
vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4
vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1
vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1
vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0
vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4
vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6
vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7
vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9
vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2
vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6
vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5
vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0
vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9
vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4

f0a2b624

13 Oct, 2015 4 commits
- vp9: refactor itx coefficients and share between 8 and 10/12bpp. · 408bb855
  Ronald S. Bultje authored 9 years ago
  
  408bb855
- vp9: add x86 simd (sse2/ssse3) for iadst4 10bpp functions. · f76423d0
  Ronald S. Bultje authored 9 years ago
  
  f76423d0
- vp9: add 10bpp simd (mmxext/ssse3) for idct_idct_4x4. · 6b579cf5
  Ronald S. Bultje authored 9 years ago
  
  6b579cf5
- vp9: add 10/12bpp mmxext-optimized iwht_iwht_4x4 function. · 1c3be325
  Ronald S. Bultje authored 9 years ago
  
  1c3be325
12 Sep, 2015 2 commits
- x86: port PSIGNW to cpuflags · d5f8a642
  James Almer authored 9 years ago
```
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
```
  d5f8a642
- vp9: save one (PSIGNW) instruction in iadst16_1d sse2/ssse3. · 4b66274a
  Ronald S. Bultje authored 9 years ago
  
  4b66274a
10 Sep, 2015 1 commit

vp9: fix overflow in 8x8 topleft 32x32 idct ssse3 version. · fd8b90f5

Ronald S. Bultje authored 9 years ago

Also disable the mmx/iwht optimization when the bitexact flag is set.
With synthetically coded coefficients (i.e. these that lead to a
residual well outside the [-255,255] range), our optimizations will
overflow. It doesn't make sense to fix the overflows, since they can
only occur on synthetic input, not on real fwht-generated input. Thus,
add a bitexact flag that disables this optimization.

fd8b90f5

06 Sep, 2015 1 commit
- vp9: fix integer overflows in sse2 version of iadst4. · f12093ff
  Ronald S. Bultje authored 9 years ago
  
  f12093ff
05 Sep, 2015 1 commit
- vp9: fix rounding error in idct_8x8_ssse3. · 086c9b78
  Ronald S. Bultje authored 9 years ago
  
  086c9b78
14 May, 2015 2 commits
- vp9: disable more pmulhrsw optimizations in idct16/32. · d32d0593
  Ronald S. Bultje authored 9 years ago
```
For idct16, only when called from a adst16x16 variant, so impact is
minor. For idct32, for all, so relatively major impact.
```
  d32d0593
- vp9: disable all pmulhrsw in 8/16 iadst x86 optimizations. · 96d30c34
  Ronald S. Bultje authored 9 years ago
```
They all overflow in various samples that are considered valid input.
```
  96d30c34
24 Apr, 2015 1 commit
- vp9: remove another optimization branch in iadst16 which causes overflows. · 3de13d52
  Ronald S. Bultje authored 9 years ago
```
See sample vp90-2-14-resize-fp-tiles-16-8.webm from the vp9 test vector
set to reproduce the issue.
```
  3de13d52
22 Apr, 2015 1 commit

vp9: remove one optimization branch in iadst16 which causes overflows. · d02d04a1

Ronald S. Bultje authored 9 years ago

See sample vp90-2-14-resize-fp-tiles-16-8-4-2-1.webm from the vp9 test
vector set which reproduces the issue. This probably costs a few cycles,
but I don't think there's an easy way to workaround that.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>

d02d04a1

08 Feb, 2015 1 commit
- x86/vp9dsp: fix clobbering of xmm6 on IDCT sse2 functions · 92d903af
  James Almer authored 10 years ago
```
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
```
  92d903af
16 Dec, 2014 1 commit
- vp9/x86: save one register on 32bit idct32x32. · 0a7964dc
  Ronald S. Bultje authored 10 years ago
```
Fixes build on win32.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
```
  0a7964dc
14 Dec, 2014 1 commit
- vp9/x86: 32bit and sse2 support for vp9 inverse transform assembly · fd77fbb3
  Ronald S. Bultje authored 10 years ago
```
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
```
  fd77fbb3
06 Aug, 2014 1 commit
- x86: vpx/h264/hevc/mpeg2: share constants · 4e128ab0
  Christophe Gisquet authored 10 years ago
```
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
```
  4e128ab0
25 Jan, 2014 4 commits

vp9/x86: use explicit register for relative stack references. · c9e6325e

Ronald S. Bultje authored 11 years ago

Before this patch, we explicitly modify rsp, which isn't necessarily
universally acceptable, since the space under the stack pointer might
be modified in things like signal handlers. Therefore, use an explicit
register to hold the stack pointer relative to the bottom of the stack
(i.e. rsp). This will also clear out valgrind errors about the use of
uninitialized data that started occurring after the idct16x16/ssse3
optimizations were first merged.

c9e6325e

vp9/x86: iwht4x4 (lossless) mmx. · 97474d52
Ronald S. Bultje authored 11 years ago

97474d52

vp9/x86: 4x4 iadst SIMD (ssse3) variants. · d43efa68

Ronald S. Bultje authored 11 years ago

Cycle measurements for intra itxfm_4x4_add on ped1080p.webm:
idct_idct:    66 -> 67 cycles (noise measurement)
idct_iadst:  199 -> 79 cycles
iadst_idct:  165 -> 70 cycles
iadst_iadst: 183 -> 82 cycles

d43efa68

vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants. · baf47020

Ronald S. Bultje authored 11 years ago

Cycle measurements for intra itxfm_8x8_add on ped1080p.webm:
idct_idct:   133 -> 135 cycles (noise measurement)
idct_iadst:  900 -> 241 cycles
iadst_idct:  864 -> 215 cycles
iadst_iadst: 973 -> 310 cycles

baf47020

16 Jan, 2014 1 commit

vp9/x86: 16x16 iadst_idct, idct_iadst and iadst_iadst (ssse3+avx). · 8173d1ff

Ronald S. Bultje authored 11 years ago

Sample timings on ped1080p.webm (of the ssse3 functions):
iadst_idct:  4672 -> 1175 cycles
idct_iadst:  4736 -> 1263 cycles
iadst_iadst: 4924 -> 1438 cycles
Total decoding time changed from 6.565s to 6.413s.

8173d1ff

15 Jan, 2014 1 commit

vp9/x86: add AVX for itxfm and lpf. · 8b4190da

Clément Bœsch authored 11 years ago

4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips

3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips

23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips

4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips

967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips

8b4190da

12 Jan, 2014 3 commits
- vp9/x86: factor out some code in VP9_UNPACK_MULSUB_2W_4X. · e11ceea6
  Clément Bœsch authored 11 years ago
  
  e11ceea6
- vp9/x86: remove reg redundancy in VP9_MULSUB_2W_2X. · c9aa0b8f
  Clément Bœsch authored 11 years ago
  
  c9aa0b8f
- vp9/x86: merge IDCT coef macros. · 7c55ee61
  Clément Bœsch authored 11 years ago
  
  7c55ee61
08 Jan, 2014 4 commits

vp9/x86: make STORE_2X2 macro local. · c6fe984f

Ronald S. Bultje authored 11 years ago

Prevents this assembler warning:
libavcodec/x86/vp9itxfm.asm:1208: warning: (VP9_IDCT32_1D:309)
redefining multi-line macro `STORE_2X2'
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>

c6fe984f

vp9/x86: idct_32x32_add_ssse3 sub-8x8-idct. · 04a187fb

Ronald S. Bultje authored 11 years ago

Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or
from 1425 to 1306 cycles (inter). Overall runtime is not significantly
affected.

04a187fb

vp9/x86: idct_32x32_add_ssse3 sub-16x16-idct. · 37b001d1

Ronald S. Bultje authored 11 years ago

Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e.
~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total
decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).

37b001d1

vp9/x86: idct_32x32_add_ssse3. · e84d14df

Ronald S. Bultje authored 11 years ago

Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).

e84d14df

26 Dec, 2013 1 commit

vp9/x86: 16x16 sub-IDCT for top-left 8x8 subblock (eob <= 38). · 0d9375fc

Ronald S. Bultje authored 11 years ago

Sub8x8 speed (w/o dc-only case) goes from ~750 cycles (inter) or ~735
cycles (intra) to ~415 cycles (inter) or ~430 cycles (intra). Average
overall 16x16 idct speed goes from ~635 cycles (inter) or ~720 cycles
(intra) to ~415 cycles (inter) or ~545 (intra) - all measurements done
using ped1080p.webm.

0d9375fc

14 Dec, 2013 1 commit

vp9/x86: idct_add_16x16_ssse3. · 8d4c616f

Ronald S. Bultje authored 11 years ago

Currently only dc-only and full 16x16. Other subforms will follow in the
near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
from ~4050 to ~745 cycles.

8d4c616f

07 Dec, 2013 4 commits
- vp9: implement top/left half (4x4) sub-8x8-IDCT. · 92436e8a
  Ronald S. Bultje authored 11 years ago
```
For that specific case (eob>3&&eob<=12), runtime of idct8x8 goes from
668 to 477 cycles. For all idct8x8, runtime goes from 521 to 490 cycles.
```
  92436e8a
- vp9: split pre-load of 11585x2 out of 1d idct macro. · b2045c44
  Ronald S. Bultje authored 11 years ago
```
This allows us to load it only once, instead of twice, in this function.
```
  b2045c44
- vp9: minor refactorings in idct ssse3 assembly. · f9a0d4c6
  Ronald S. Bultje authored 11 years ago
```
Make register usage in macros explicit; change mulsub_2w_4x to use 2
instead of 3 temp registers.
```
  f9a0d4c6
- vp9: split x86 assembly in two files. · 8729964b
  Ronald S. Bultje authored 11 years ago
```
(And in future, loopfilter or intra pred could be put in their own
respective files also.)
```
  8729964b
21 Nov, 2013 1 commit
- avcodec/x86/vp9dsp: merge a few SWAP together. · 616da595
  Clément Bœsch authored 11 years ago
  
  616da595
15 Nov, 2013 1 commit

lavc: VP9 decoder · 72ca830f

Ronald S. Bultje authored 11 years ago

Originally written by Ronald S. Bultje <rsbultje@gmail.com> and
Clément Bœsch <u@pkh.me>

Further contributions by:
Anton Khirnov <anton@khirnov.net>
Diego Biurrun <diego@biurrun.de>
Luca Barbato <lu_zero@gentoo.org>
Martin Storsjö <martin@martin.st>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Signed-off-by: Anton Khirnov <anton@khirnov.net>

72ca830f