- 26 Jul, 2016 1 commit
-
-
Ronald S. Bultje authored
About 1.8x speedup compared to AVX version for full IDCT. Other sub-IDCT scenarios also see speedups. Full --bench output for idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles): nop: 16.5 vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4 vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0 vp9_inv_dct_dct_32x32_add_8_1_ssse3: 137.4 vp9_inv_dct_dct_32x32_add_8_1_avx: 137.1 vp9_inv_dct_dct_32x32_add_8_1_avx2: 73.2 vp9_inv_dct_dct_32x32_add_8_2_c: 14680.8 vp9_inv_dct_dct_32x32_add_8_2_sse2: 2617.2 vp9_inv_dct_dct_32x32_add_8_2_ssse3: 982.9 vp9_inv_dct_dct_32x32_add_8_2_avx: 958.5 vp9_inv_dct_dct_32x32_add_8_2_avx2: 704.2 vp9_inv_dct_dct_32x32_add_8_4_c: 14443.1 vp9_inv_dct_dct_32x32_add_8_4_sse2: 2717.1 vp9_inv_dct_dct_32x32_add_8_4_ssse3: 965.7 vp9_inv_dct_dct_32x32_add_8_4_avx: 1000.7 vp9_inv_dct_dct_32x32_add_8_4_avx2: 717.1 vp9_inv_dct_dct_32x32_add_8_8_c: 14436.4 vp9_inv_dct_dct_32x32_add_8_8_sse2: 2671.8 vp9_inv_dct_dct_32x32_add_8_8_ssse3: 1038.5 vp9_inv_dct_dct_32x32_add_8_8_avx: 983.0 vp9_inv_dct_dct_32x32_add_8_8_avx2: 729.4 vp9_inv_dct_dct_32x32_add_8_16_c: 14614.7 vp9_inv_dct_dct_32x32_add_8_16_sse2: 2701.7 vp9_inv_dct_dct_32x32_add_8_16_ssse3: 1334.4 vp9_inv_dct_dct_32x32_add_8_16_avx: 1276.7 vp9_inv_dct_dct_32x32_add_8_16_avx2: 719.5 vp9_inv_dct_dct_32x32_add_8_32_c: 14363.6 vp9_inv_dct_dct_32x32_add_8_32_sse2: 2575.6 vp9_inv_dct_dct_32x32_add_8_32_ssse3: 2633.9 vp9_inv_dct_dct_32x32_add_8_32_avx: 2539.6 vp9_inv_dct_dct_32x32_add_8_32_avx2: 1395.0
-
- 11 Jul, 2016 1 commit
-
-
Ronald S. Bultje authored
checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows that it's about 1.65x as fast as the AVX version for the full IDCT, and similar speedups for the sub-IDCTs: nop: 24.6 vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8 vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6 vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4 vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2 vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5 vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7 vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9 vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2 vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9 vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3 vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7 vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4 vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1 vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1 vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0 vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4 vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6 vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7 vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9 vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2 vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6 vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5 vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0 vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9 vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
-
- 13 Oct, 2015 4 commits
-
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-
- 12 Sep, 2015 2 commits
-
-
James Almer authored
Reviewed-by:
Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by:
James Almer <jamrial@gmail.com>
-
Ronald S. Bultje authored
-
- 10 Sep, 2015 1 commit
-
-
Ronald S. Bultje authored
Also disable the mmx/iwht optimization when the bitexact flag is set. With synthetically coded coefficients (i.e. these that lead to a residual well outside the [-255,255] range), our optimizations will overflow. It doesn't make sense to fix the overflows, since they can only occur on synthetic input, not on real fwht-generated input. Thus, add a bitexact flag that disables this optimization.
-
- 06 Sep, 2015 1 commit
-
-
Ronald S. Bultje authored
-
- 05 Sep, 2015 1 commit
-
-
Ronald S. Bultje authored
-
- 14 May, 2015 2 commits
-
-
Ronald S. Bultje authored
For idct16, only when called from a adst16x16 variant, so impact is minor. For idct32, for all, so relatively major impact.
-
Ronald S. Bultje authored
They all overflow in various samples that are considered valid input.
-
- 24 Apr, 2015 1 commit
-
-
Ronald S. Bultje authored
See sample vp90-2-14-resize-fp-tiles-16-8.webm from the vp9 test vector set to reproduce the issue.
-
- 22 Apr, 2015 1 commit
-
-
Ronald S. Bultje authored
See sample vp90-2-14-resize-fp-tiles-16-8-4-2-1.webm from the vp9 test vector set which reproduces the issue. This probably costs a few cycles, but I don't think there's an easy way to workaround that. Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 08 Feb, 2015 1 commit
-
-
James Almer authored
Reviewed-by:
Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by:
James Almer <jamrial@gmail.com>
-
- 16 Dec, 2014 1 commit
-
-
Ronald S. Bultje authored
Fixes build on win32. Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 14 Dec, 2014 1 commit
-
-
Ronald S. Bultje authored
Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 06 Aug, 2014 1 commit
-
-
Christophe Gisquet authored
Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 25 Jan, 2014 4 commits
-
-
Ronald S. Bultje authored
Before this patch, we explicitly modify rsp, which isn't necessarily universally acceptable, since the space under the stack pointer might be modified in things like signal handlers. Therefore, use an explicit register to hold the stack pointer relative to the bottom of the stack (i.e. rsp). This will also clear out valgrind errors about the use of uninitialized data that started occurring after the idct16x16/ssse3 optimizations were first merged.
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
Cycle measurements for intra itxfm_4x4_add on ped1080p.webm: idct_idct: 66 -> 67 cycles (noise measurement) idct_iadst: 199 -> 79 cycles iadst_idct: 165 -> 70 cycles iadst_iadst: 183 -> 82 cycles
-
Ronald S. Bultje authored
Cycle measurements for intra itxfm_8x8_add on ped1080p.webm: idct_idct: 133 -> 135 cycles (noise measurement) idct_iadst: 900 -> 241 cycles iadst_idct: 864 -> 215 cycles iadst_iadst: 973 -> 310 cycles
-
- 16 Jan, 2014 1 commit
-
-
Ronald S. Bultje authored
Sample timings on ped1080p.webm (of the ssse3 functions): iadst_idct: 4672 -> 1175 cycles idct_iadst: 4736 -> 1263 cycles iadst_iadst: 4924 -> 1438 cycles Total decoding time changed from 6.565s to 6.413s.
-
- 15 Jan, 2014 1 commit
-
-
Clément Bœsch authored
4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips 3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips 3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips 2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips 23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips 19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips 4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips 3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips 967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips 887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
-
- 12 Jan, 2014 3 commits
-
-
Clément Bœsch authored
-
Clément Bœsch authored
-
Clément Bœsch authored
-
- 08 Jan, 2014 4 commits
-
-
Ronald S. Bultje authored
Prevents this assembler warning: libavcodec/x86/vp9itxfm.asm:1208: warning: (VP9_IDCT32_1D:309) redefining multi-line macro `STORE_2X2' Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
Ronald S. Bultje authored
Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or from 1425 to 1306 cycles (inter). Overall runtime is not significantly affected.
-
Ronald S. Bultje authored
Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e. ~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).
-
Ronald S. Bultje authored
Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s (13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter) to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra) or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra) or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all tests done on ped1080p.webm).
-
- 26 Dec, 2013 1 commit
-
-
Ronald S. Bultje authored
Sub8x8 speed (w/o dc-only case) goes from ~750 cycles (inter) or ~735 cycles (intra) to ~415 cycles (inter) or ~430 cycles (intra). Average overall 16x16 idct speed goes from ~635 cycles (inter) or ~720 cycles (intra) to ~415 cycles (inter) or ~545 (intra) - all measurements done using ped1080p.webm.
-
- 14 Dec, 2013 1 commit
-
-
Ronald S. Bultje authored
Currently only dc-only and full 16x16. Other subforms will follow in the near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3 seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes from ~4050 to ~745 cycles.
-
- 07 Dec, 2013 4 commits
-
-
Ronald S. Bultje authored
For that specific case (eob>3&&eob<=12), runtime of idct8x8 goes from 668 to 477 cycles. For all idct8x8, runtime goes from 521 to 490 cycles.
-
Ronald S. Bultje authored
This allows us to load it only once, instead of twice, in this function.
-
Ronald S. Bultje authored
Make register usage in macros explicit; change mulsub_2w_4x to use 2 instead of 3 temp registers.
-
Ronald S. Bultje authored
(And in future, loopfilter or intra pred could be put in their own respective files also.)
-
- 21 Nov, 2013 1 commit
-
-
Clément Bœsch authored
-
- 15 Nov, 2013 1 commit
-
-
Ronald S. Bultje authored
Originally written by Ronald S. Bultje <rsbultje@gmail.com> and Clément Bœsch <u@pkh.me> Further contributions by: Anton Khirnov <anton@khirnov.net> Diego Biurrun <diego@biurrun.de> Luca Barbato <lu_zero@gentoo.org> Martin Storsjö <martin@martin.st> Signed-off-by:
Luca Barbato <lu_zero@gentoo.org> Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-