Commits · a7ed01082f25aefdb9996716737b9892dafdf9de · Linshizhi / ffmpeg.wasm-core

06 Aug, 2014 1 commit
- x86: vpx/h264/hevc/mpeg2: share constants · 4e128ab0
  Christophe Gisquet authored 10 years ago
```
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
```
  4e128ab0
25 Jan, 2014 4 commits

vp9/x86: use explicit register for relative stack references. · c9e6325e

Ronald S. Bultje authored 10 years ago

Before this patch, we explicitly modify rsp, which isn't necessarily
universally acceptable, since the space under the stack pointer might
be modified in things like signal handlers. Therefore, use an explicit
register to hold the stack pointer relative to the bottom of the stack
(i.e. rsp). This will also clear out valgrind errors about the use of
uninitialized data that started occurring after the idct16x16/ssse3
optimizations were first merged.

c9e6325e

vp9/x86: iwht4x4 (lossless) mmx. · 97474d52
Ronald S. Bultje authored 10 years ago

97474d52

vp9/x86: 4x4 iadst SIMD (ssse3) variants. · d43efa68

Ronald S. Bultje authored 10 years ago

Cycle measurements for intra itxfm_4x4_add on ped1080p.webm:
idct_idct:    66 -> 67 cycles (noise measurement)
idct_iadst:  199 -> 79 cycles
iadst_idct:  165 -> 70 cycles
iadst_iadst: 183 -> 82 cycles

d43efa68

vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants. · baf47020

Ronald S. Bultje authored 10 years ago

Cycle measurements for intra itxfm_8x8_add on ped1080p.webm:
idct_idct:   133 -> 135 cycles (noise measurement)
idct_iadst:  900 -> 241 cycles
iadst_idct:  864 -> 215 cycles
iadst_iadst: 973 -> 310 cycles

baf47020

16 Jan, 2014 1 commit

vp9/x86: 16x16 iadst_idct, idct_iadst and iadst_iadst (ssse3+avx). · 8173d1ff

Ronald S. Bultje authored 10 years ago

Sample timings on ped1080p.webm (of the ssse3 functions):
iadst_idct:  4672 -> 1175 cycles
idct_iadst:  4736 -> 1263 cycles
iadst_iadst: 4924 -> 1438 cycles
Total decoding time changed from 6.565s to 6.413s.

8173d1ff

15 Jan, 2014 1 commit

vp9/x86: add AVX for itxfm and lpf. · 8b4190da

Clément Bœsch authored 10 years ago

4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips

3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips

23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips

4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips

967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips

8b4190da

12 Jan, 2014 3 commits
- vp9/x86: factor out some code in VP9_UNPACK_MULSUB_2W_4X. · e11ceea6
  Clément Bœsch authored 10 years ago
  
  e11ceea6
- vp9/x86: remove reg redundancy in VP9_MULSUB_2W_2X. · c9aa0b8f
  Clément Bœsch authored 10 years ago
  
  c9aa0b8f
- vp9/x86: merge IDCT coef macros. · 7c55ee61
  Clément Bœsch authored 10 years ago
  
  7c55ee61
08 Jan, 2014 4 commits

vp9/x86: make STORE_2X2 macro local. · c6fe984f

Ronald S. Bultje authored 10 years ago

Prevents this assembler warning:
libavcodec/x86/vp9itxfm.asm:1208: warning: (VP9_IDCT32_1D:309)
redefining multi-line macro `STORE_2X2'
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>

c6fe984f

vp9/x86: idct_32x32_add_ssse3 sub-8x8-idct. · 04a187fb

Ronald S. Bultje authored 11 years ago

Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or
from 1425 to 1306 cycles (inter). Overall runtime is not significantly
affected.

04a187fb

vp9/x86: idct_32x32_add_ssse3 sub-16x16-idct. · 37b001d1

Ronald S. Bultje authored 11 years ago

Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e.
~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total
decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).

37b001d1

vp9/x86: idct_32x32_add_ssse3. · e84d14df

Ronald S. Bultje authored 11 years ago

Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).

e84d14df

26 Dec, 2013 1 commit

vp9/x86: 16x16 sub-IDCT for top-left 8x8 subblock (eob <= 38). · 0d9375fc

Ronald S. Bultje authored 11 years ago

Sub8x8 speed (w/o dc-only case) goes from ~750 cycles (inter) or ~735
cycles (intra) to ~415 cycles (inter) or ~430 cycles (intra). Average
overall 16x16 idct speed goes from ~635 cycles (inter) or ~720 cycles
(intra) to ~415 cycles (inter) or ~545 (intra) - all measurements done
using ped1080p.webm.

0d9375fc

14 Dec, 2013 1 commit

vp9/x86: idct_add_16x16_ssse3. · 8d4c616f

Ronald S. Bultje authored 11 years ago

Currently only dc-only and full 16x16. Other subforms will follow in the
near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
from ~4050 to ~745 cycles.

8d4c616f

07 Dec, 2013 4 commits
- vp9: implement top/left half (4x4) sub-8x8-IDCT. · 92436e8a
  Ronald S. Bultje authored 11 years ago
```
For that specific case (eob>3&&eob<=12), runtime of idct8x8 goes from
668 to 477 cycles. For all idct8x8, runtime goes from 521 to 490 cycles.
```
  92436e8a
- vp9: split pre-load of 11585x2 out of 1d idct macro. · b2045c44
  Ronald S. Bultje authored 11 years ago
```
This allows us to load it only once, instead of twice, in this function.
```
  b2045c44
- vp9: minor refactorings in idct ssse3 assembly. · f9a0d4c6
  Ronald S. Bultje authored 11 years ago
```
Make register usage in macros explicit; change mulsub_2w_4x to use 2
instead of 3 temp registers.
```
  f9a0d4c6
- vp9: split x86 assembly in two files. · 8729964b
  Ronald S. Bultje authored 11 years ago
```
(And in future, loopfilter or intra pred could be put in their own
respective files also.)
```
  8729964b
21 Nov, 2013 1 commit
- avcodec/x86/vp9dsp: merge a few SWAP together. · 616da595
  Clément Bœsch authored 11 years ago
  
  616da595
15 Nov, 2013 1 commit

lavc: VP9 decoder · 72ca830f

Ronald S. Bultje authored 11 years ago

Originally written by Ronald S. Bultje <rsbultje@gmail.com> and
Clément Bœsch <u@pkh.me>

Further contributions by:
Anton Khirnov <anton@khirnov.net>
Diego Biurrun <diego@biurrun.de>
Luca Barbato <lu_zero@gentoo.org>
Martin Storsjö <martin@martin.st>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Signed-off-by: Anton Khirnov <anton@khirnov.net>

72ca830f

05 Nov, 2013 1 commit

avcodec/vp9: add ff_vp9_idct_idct_{4x4,8x8}_ssse3(). · 87434cf3

Clément Bœsch authored 11 years ago

1789 decicycles in idct_idct_4x4_add_c, 262136 runs, 8 skips
1839 decicycles in idct_idct_4x4_add_c, 524270 runs, 18 skips
1864 decicycles in idct_idct_4x4_add_c, 1048548 runs, 28 skips

529 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 262138 runs, 6 skips
516 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 524282 runs, 6 skips
474 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 1048565 runs, 11 skips

(~3.9x faster)

7726 decicycles in idct_idct_8x8_add_c, 1048433 runs, 143 skips
7732 decicycles in idct_idct_8x8_add_c, 2096882 runs, 270 skips
7731 decicycles in idct_idct_8x8_add_c, 4193772 runs, 532 skips

1145 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 1048549 runs, 27 skips
1137 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 2097097 runs, 55 skips
1086 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 4194188 runs, 116 skips

(~7.1x faster)

Overall decode time before commit:
  16.48s user 0.03s system 99% cpu 16.526 total
  16.54s user 0.01s system 99% cpu 16.566 total
  16.46s user 0.03s system 99% cpu 16.511 total

Overall decode time after commit:
  16.34s user 0.02s system 99% cpu 16.378 total
  16.28s user 0.02s system 99% cpu 16.315 total
  16.32s user 0.03s system 99% cpu 16.366 total

Tested on i7 920 with 40s 1080p footage.

87434cf3

08 Oct, 2013 1 commit
- avcodec/x86/vp9dsp: Fix compilation with nasm. · ba9c557b
  Ronald S. Bultje authored 11 years ago
```
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
```
  ba9c557b
03 Oct, 2013 2 commits
- Full-pixel MC functions. · f1548c00
  Ronald S. Bultje authored 11 years ago
```
Decoding time of ped1080p.webm goes from 11.3sec to 11.1sec.
```
  f1548c00
- VP9 MC (ssse3) optimizations. · c07ac8d4
  Ronald S. Bultje authored 11 years ago
```
Decoding time of ped1080p.webm goes from 20.7sec to 11.3sec.
```
  c07ac8d4