1. 21 Jun, 2017 1 commit
  2. 28 Mar, 2017 1 commit
  3. 27 Mar, 2017 1 commit
  4. 01 Mar, 2017 1 commit
  5. 15 Nov, 2016 1 commit
    • Ronald S. Bultje's avatar
      vp9: add avx2 iadst16 implementations. · 83a139e3
      Ronald S. Bultje authored
      Also a small cosmetic change to the avx2 idct16 version to make it
      explicit that one of the arguments to the write-out macros is unused
      for >=avx2 (it uses pmovzxbw instead of punpcklbw).
      83a139e3
  6. 05 Nov, 2016 1 commit
    • Diego Biurrun's avatar
      x86: Drop stray semicolons after function definitions · 3cba09e5
      Diego Biurrun authored
      libavcodec/x86/rv40dsp_init.c:97:2: warning: ISO C does not allow extra ‘;’ outside of a function [-Wpedantic]
      libavcodec/x86/vp9dsp_init.c:94:40: warning: ISO C does not allow extra ‘;’ outside of a function [-Wpedantic]
      3cba09e5
  7. 03 Nov, 2016 1 commit
  8. 04 Oct, 2016 13 commits
  9. 03 Aug, 2016 5 commits
  10. 26 Jul, 2016 3 commits
    • Ronald S. Bultje's avatar
      vp9: add mxext versions of the single-block (w=8,npx=8) h/v loopfilters. · a4edaa02
      Ronald S. Bultje authored
      Each takes about 0.1% of runtime in my profiles, and they didn't have
      any SIMD yet so far (we only had simd for npx=16 double-block versions).
      a4edaa02
    • Ronald S. Bultje's avatar
      vp9: add mxext versions of the single-block (w=4,npx=8) h/v loopfilters. · 7ca422bb
      Ronald S. Bultje authored
      Each takes about 0.5% of runtime in my profiles, and they didn't have
      any SIMD yet so far (we only had simd for npx=16 double-block versions).
      7ca422bb
    • Ronald S. Bultje's avatar
      vp9: add 32x32 idct AVX2 implementation. · 726501a3
      Ronald S. Bultje authored
      About 1.8x speedup compared to AVX version for full IDCT. Other
      sub-IDCT scenarios also see speedups. Full --bench output for
      idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles):
      
      nop: 16.5
      vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4
      vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0
      vp9_inv_dct_dct_32x32_add_8_1_ssse3: 137.4
      vp9_inv_dct_dct_32x32_add_8_1_avx: 137.1
      vp9_inv_dct_dct_32x32_add_8_1_avx2: 73.2
      vp9_inv_dct_dct_32x32_add_8_2_c: 14680.8
      vp9_inv_dct_dct_32x32_add_8_2_sse2: 2617.2
      vp9_inv_dct_dct_32x32_add_8_2_ssse3: 982.9
      vp9_inv_dct_dct_32x32_add_8_2_avx: 958.5
      vp9_inv_dct_dct_32x32_add_8_2_avx2: 704.2
      vp9_inv_dct_dct_32x32_add_8_4_c: 14443.1
      vp9_inv_dct_dct_32x32_add_8_4_sse2: 2717.1
      vp9_inv_dct_dct_32x32_add_8_4_ssse3: 965.7
      vp9_inv_dct_dct_32x32_add_8_4_avx: 1000.7
      vp9_inv_dct_dct_32x32_add_8_4_avx2: 717.1
      vp9_inv_dct_dct_32x32_add_8_8_c: 14436.4
      vp9_inv_dct_dct_32x32_add_8_8_sse2: 2671.8
      vp9_inv_dct_dct_32x32_add_8_8_ssse3: 1038.5
      vp9_inv_dct_dct_32x32_add_8_8_avx: 983.0
      vp9_inv_dct_dct_32x32_add_8_8_avx2: 729.4
      vp9_inv_dct_dct_32x32_add_8_16_c: 14614.7
      vp9_inv_dct_dct_32x32_add_8_16_sse2: 2701.7
      vp9_inv_dct_dct_32x32_add_8_16_ssse3: 1334.4
      vp9_inv_dct_dct_32x32_add_8_16_avx: 1276.7
      vp9_inv_dct_dct_32x32_add_8_16_avx2: 719.5
      vp9_inv_dct_dct_32x32_add_8_32_c: 14363.6
      vp9_inv_dct_dct_32x32_add_8_32_sse2: 2575.6
      vp9_inv_dct_dct_32x32_add_8_32_ssse3: 2633.9
      vp9_inv_dct_dct_32x32_add_8_32_avx: 2539.6
      vp9_inv_dct_dct_32x32_add_8_32_avx2: 1395.0
      726501a3
  11. 11 Jul, 2016 1 commit
    • Ronald S. Bultje's avatar
      vp9: add 16x16 idct avx2 (8-bit). · f0a2b624
      Ronald S. Bultje authored
      checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
      that it's about 1.65x as fast as the AVX version for the full IDCT, and
      similar speedups for the sub-IDCTs:
      
      nop: 24.6
      vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8
      vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6
      vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4
      vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2
      vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5
      vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7
      vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9
      vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2
      vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9
      vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3
      vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7
      vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4
      vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1
      vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1
      vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0
      vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4
      vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6
      vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7
      vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9
      vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2
      vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6
      vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5
      vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0
      vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9
      vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
      f0a2b624
  12. 28 May, 2016 1 commit
  13. 14 Feb, 2016 1 commit
  14. 24 Oct, 2015 1 commit
  15. 13 Oct, 2015 1 commit
  16. 17 Sep, 2015 3 commits
  17. 10 Sep, 2015 1 commit
    • Ronald S. Bultje's avatar
      vp9: fix overflow in 8x8 topleft 32x32 idct ssse3 version. · fd8b90f5
      Ronald S. Bultje authored
      Also disable the mmx/iwht optimization when the bitexact flag is set.
      With synthetically coded coefficients (i.e. these that lead to a
      residual well outside the [-255,255] range), our optimizations will
      overflow. It doesn't make sense to fix the overflows, since they can
      only occur on synthetic input, not on real fwht-generated input. Thus,
      add a bitexact flag that disables this optimization.
      fd8b90f5
  18. 31 May, 2015 1 commit
  19. 07 May, 2015 1 commit
  20. 06 May, 2015 1 commit