- 21 Jun, 2017 1 commit
-
-
Diego Biurrun authored
None of them are specific to the YASM assembler. (Cherry-picked from libav commit 39e208f4) Signed-off-by:
James Almer <jamrial@gmail.com>
-
- 28 Mar, 2017 1 commit
-
-
Ronald S. Bultje authored
The advantage here is that the internal software decoder interface is not exposed to the DSP functions or the hardware accelerations.
-
- 27 Mar, 2017 1 commit
-
-
Clément Bœsch authored
This is following Libav layout to ease merges.
-
- 01 Mar, 2017 1 commit
-
-
Diego Biurrun authored
None of them are specific to the YASM assembler.
-
- 15 Nov, 2016 1 commit
-
-
Ronald S. Bultje authored
Also a small cosmetic change to the avx2 idct16 version to make it explicit that one of the arguments to the write-out macros is unused for >=avx2 (it uses pmovzxbw instead of punpcklbw).
-
- 05 Nov, 2016 1 commit
-
-
Diego Biurrun authored
libavcodec/x86/rv40dsp_init.c:97:2: warning: ISO C does not allow extra ‘;’ outside of a function [-Wpedantic] libavcodec/x86/vp9dsp_init.c:94:40: warning: ISO C does not allow extra ‘;’ outside of a function [-Wpedantic]
-
- 03 Nov, 2016 1 commit
-
-
Martin Storsjö authored
This makes it match the pattern already used for VP8 MC functions. This also makes the signature match ffmpeg's version of these functions, easing porting of code in both directions. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
- 04 Oct, 2016 13 commits
-
-
Ronald S. Bultje authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Ronald S. Bultje authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Ronald S. Bultje authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Ronald S. Bultje authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Ronald S. Bultje authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Ronald S. Bultje authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Ronald S. Bultje authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Clément Bœsch authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Clément Bœsch authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
James Almer authored
Similar gains as the ssse3 version once again Additional improvements by Clément Bœsch <u@pkh.me>. Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Clément Bœsch authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
James Almer authored
Similar gains in performance as the SSSE3 version Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Clément Bœsch authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
- 03 Aug, 2016 5 commits
-
-
Ronald S. Bultje authored
Also a slight change to the ssse3 code, which prevents a theoretical overflow in the sharp filter. Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
James Almer authored
Roughly 25% faster MC than ssse3 for blocksizes 32 and 64. Reviewed-by:
Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Clément Bœsch authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
James Almer authored
pavgb is an sse integer instruction, so the mmxext flag is enough Signed-off-by:
James Almer <jamrial@gmail.com> Reviewed-by:
"Ronald S. Bultje" <rsbultje@gmail.com> Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
Ronald S. Bultje authored
Signed-off-by:
Anton Khirnov <anton@khirnov.net>
-
- 26 Jul, 2016 3 commits
-
-
Ronald S. Bultje authored
Each takes about 0.1% of runtime in my profiles, and they didn't have any SIMD yet so far (we only had simd for npx=16 double-block versions).
-
Ronald S. Bultje authored
Each takes about 0.5% of runtime in my profiles, and they didn't have any SIMD yet so far (we only had simd for npx=16 double-block versions).
-
Ronald S. Bultje authored
About 1.8x speedup compared to AVX version for full IDCT. Other sub-IDCT scenarios also see speedups. Full --bench output for idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles): nop: 16.5 vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4 vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0 vp9_inv_dct_dct_32x32_add_8_1_ssse3: 137.4 vp9_inv_dct_dct_32x32_add_8_1_avx: 137.1 vp9_inv_dct_dct_32x32_add_8_1_avx2: 73.2 vp9_inv_dct_dct_32x32_add_8_2_c: 14680.8 vp9_inv_dct_dct_32x32_add_8_2_sse2: 2617.2 vp9_inv_dct_dct_32x32_add_8_2_ssse3: 982.9 vp9_inv_dct_dct_32x32_add_8_2_avx: 958.5 vp9_inv_dct_dct_32x32_add_8_2_avx2: 704.2 vp9_inv_dct_dct_32x32_add_8_4_c: 14443.1 vp9_inv_dct_dct_32x32_add_8_4_sse2: 2717.1 vp9_inv_dct_dct_32x32_add_8_4_ssse3: 965.7 vp9_inv_dct_dct_32x32_add_8_4_avx: 1000.7 vp9_inv_dct_dct_32x32_add_8_4_avx2: 717.1 vp9_inv_dct_dct_32x32_add_8_8_c: 14436.4 vp9_inv_dct_dct_32x32_add_8_8_sse2: 2671.8 vp9_inv_dct_dct_32x32_add_8_8_ssse3: 1038.5 vp9_inv_dct_dct_32x32_add_8_8_avx: 983.0 vp9_inv_dct_dct_32x32_add_8_8_avx2: 729.4 vp9_inv_dct_dct_32x32_add_8_16_c: 14614.7 vp9_inv_dct_dct_32x32_add_8_16_sse2: 2701.7 vp9_inv_dct_dct_32x32_add_8_16_ssse3: 1334.4 vp9_inv_dct_dct_32x32_add_8_16_avx: 1276.7 vp9_inv_dct_dct_32x32_add_8_16_avx2: 719.5 vp9_inv_dct_dct_32x32_add_8_32_c: 14363.6 vp9_inv_dct_dct_32x32_add_8_32_sse2: 2575.6 vp9_inv_dct_dct_32x32_add_8_32_ssse3: 2633.9 vp9_inv_dct_dct_32x32_add_8_32_avx: 2539.6 vp9_inv_dct_dct_32x32_add_8_32_avx2: 1395.0
-
- 11 Jul, 2016 1 commit
-
-
Ronald S. Bultje authored
checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows that it's about 1.65x as fast as the AVX version for the full IDCT, and similar speedups for the sub-IDCTs: nop: 24.6 vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8 vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6 vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4 vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2 vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5 vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7 vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9 vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2 vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9 vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3 vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7 vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4 vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1 vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1 vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0 vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4 vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6 vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7 vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9 vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2 vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6 vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5 vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0 vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9 vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
-
- 28 May, 2016 1 commit
-
-
Diego Biurrun authored
-
- 14 Feb, 2016 1 commit
-
-
James Almer authored
Reviewed-by:
Michael Niedermayer <michael@niedermayer.cc> Signed-off-by:
James Almer <jamrial@gmail.com>
-
- 24 Oct, 2015 1 commit
-
-
Ganesh Ajjanagadde authored
This fixes extra semicolons that clang 3.7 on GNU/Linux warns about. These were trigggered when built under -Wpedantic, which essentially checks for strict ISO compliance in numerous ways. Reviewed-by:
Michael Niedermayer <michael@niedermayer.cc> Signed-off-by:
Ganesh Ajjanagadde <gajjanagadde@gmail.com>
-
- 13 Oct, 2015 1 commit
-
-
Ronald S. Bultje authored
-
- 17 Sep, 2015 3 commits
-
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-
- 10 Sep, 2015 1 commit
-
-
Ronald S. Bultje authored
Also disable the mmx/iwht optimization when the bitexact flag is set. With synthetically coded coefficients (i.e. these that lead to a residual well outside the [-255,255] range), our optimizations will overflow. It doesn't make sense to fix the overflows, since they can only occur on synthetic input, not on real fwht-generated input. Thus, add a bitexact flag that disables this optimization.
-
- 31 May, 2015 1 commit
-
-
James Almer authored
Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 07 May, 2015 1 commit
-
-
Michael Niedermayer authored
Reviewed-by:
"Ronald S. Bultje" <rsbultje@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 06 May, 2015 1 commit
-
-
Ronald S. Bultje authored
-