1. 26 Jul, 2016 1 commit
    • Ronald S. Bultje's avatar
      vp9: add 32x32 idct AVX2 implementation. · 726501a3
      Ronald S. Bultje authored
      About 1.8x speedup compared to AVX version for full IDCT. Other
      sub-IDCT scenarios also see speedups. Full --bench output for
      idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles):
      
      nop: 16.5
      vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4
      vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0
      vp9_inv_dct_dct_32x32_add_8_1_ssse3: 137.4
      vp9_inv_dct_dct_32x32_add_8_1_avx: 137.1
      vp9_inv_dct_dct_32x32_add_8_1_avx2: 73.2
      vp9_inv_dct_dct_32x32_add_8_2_c: 14680.8
      vp9_inv_dct_dct_32x32_add_8_2_sse2: 2617.2
      vp9_inv_dct_dct_32x32_add_8_2_ssse3: 982.9
      vp9_inv_dct_dct_32x32_add_8_2_avx: 958.5
      vp9_inv_dct_dct_32x32_add_8_2_avx2: 704.2
      vp9_inv_dct_dct_32x32_add_8_4_c: 14443.1
      vp9_inv_dct_dct_32x32_add_8_4_sse2: 2717.1
      vp9_inv_dct_dct_32x32_add_8_4_ssse3: 965.7
      vp9_inv_dct_dct_32x32_add_8_4_avx: 1000.7
      vp9_inv_dct_dct_32x32_add_8_4_avx2: 717.1
      vp9_inv_dct_dct_32x32_add_8_8_c: 14436.4
      vp9_inv_dct_dct_32x32_add_8_8_sse2: 2671.8
      vp9_inv_dct_dct_32x32_add_8_8_ssse3: 1038.5
      vp9_inv_dct_dct_32x32_add_8_8_avx: 983.0
      vp9_inv_dct_dct_32x32_add_8_8_avx2: 729.4
      vp9_inv_dct_dct_32x32_add_8_16_c: 14614.7
      vp9_inv_dct_dct_32x32_add_8_16_sse2: 2701.7
      vp9_inv_dct_dct_32x32_add_8_16_ssse3: 1334.4
      vp9_inv_dct_dct_32x32_add_8_16_avx: 1276.7
      vp9_inv_dct_dct_32x32_add_8_16_avx2: 719.5
      vp9_inv_dct_dct_32x32_add_8_32_c: 14363.6
      vp9_inv_dct_dct_32x32_add_8_32_sse2: 2575.6
      vp9_inv_dct_dct_32x32_add_8_32_ssse3: 2633.9
      vp9_inv_dct_dct_32x32_add_8_32_avx: 2539.6
      vp9_inv_dct_dct_32x32_add_8_32_avx2: 1395.0
      726501a3
  2. 11 Jul, 2016 1 commit
    • Ronald S. Bultje's avatar
      vp9: add 16x16 idct avx2 (8-bit). · f0a2b624
      Ronald S. Bultje authored
      checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
      that it's about 1.65x as fast as the AVX version for the full IDCT, and
      similar speedups for the sub-IDCTs:
      
      nop: 24.6
      vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8
      vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6
      vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4
      vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2
      vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5
      vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7
      vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9
      vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2
      vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9
      vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3
      vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7
      vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4
      vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1
      vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1
      vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0
      vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4
      vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6
      vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7
      vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9
      vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2
      vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6
      vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5
      vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0
      vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9
      vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
      f0a2b624
  3. 13 Oct, 2015 4 commits
  4. 12 Sep, 2015 2 commits
  5. 10 Sep, 2015 1 commit
    • Ronald S. Bultje's avatar
      vp9: fix overflow in 8x8 topleft 32x32 idct ssse3 version. · fd8b90f5
      Ronald S. Bultje authored
      Also disable the mmx/iwht optimization when the bitexact flag is set.
      With synthetically coded coefficients (i.e. these that lead to a
      residual well outside the [-255,255] range), our optimizations will
      overflow. It doesn't make sense to fix the overflows, since they can
      only occur on synthetic input, not on real fwht-generated input. Thus,
      add a bitexact flag that disables this optimization.
      fd8b90f5
  6. 06 Sep, 2015 1 commit
  7. 05 Sep, 2015 1 commit
  8. 14 May, 2015 2 commits
  9. 24 Apr, 2015 1 commit
  10. 22 Apr, 2015 1 commit
  11. 08 Feb, 2015 1 commit
  12. 16 Dec, 2014 1 commit
  13. 14 Dec, 2014 1 commit
  14. 06 Aug, 2014 1 commit
  15. 25 Jan, 2014 4 commits
    • Ronald S. Bultje's avatar
      vp9/x86: use explicit register for relative stack references. · c9e6325e
      Ronald S. Bultje authored
      Before this patch, we explicitly modify rsp, which isn't necessarily
      universally acceptable, since the space under the stack pointer might
      be modified in things like signal handlers. Therefore, use an explicit
      register to hold the stack pointer relative to the bottom of the stack
      (i.e. rsp). This will also clear out valgrind errors about the use of
      uninitialized data that started occurring after the idct16x16/ssse3
      optimizations were first merged.
      c9e6325e
    • Ronald S. Bultje's avatar
      vp9/x86: iwht4x4 (lossless) mmx. · 97474d52
      Ronald S. Bultje authored
      97474d52
    • Ronald S. Bultje's avatar
      vp9/x86: 4x4 iadst SIMD (ssse3) variants. · d43efa68
      Ronald S. Bultje authored
      Cycle measurements for intra itxfm_4x4_add on ped1080p.webm:
      idct_idct:    66 -> 67 cycles (noise measurement)
      idct_iadst:  199 -> 79 cycles
      iadst_idct:  165 -> 70 cycles
      iadst_iadst: 183 -> 82 cycles
      d43efa68
    • Ronald S. Bultje's avatar
      vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants. · baf47020
      Ronald S. Bultje authored
      Cycle measurements for intra itxfm_8x8_add on ped1080p.webm:
      idct_idct:   133 -> 135 cycles (noise measurement)
      idct_iadst:  900 -> 241 cycles
      iadst_idct:  864 -> 215 cycles
      iadst_iadst: 973 -> 310 cycles
      baf47020
  16. 16 Jan, 2014 1 commit
  17. 15 Jan, 2014 1 commit
    • Clément Bœsch's avatar
      vp9/x86: add AVX for itxfm and lpf. · 8b4190da
      Clément Bœsch authored
      4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
      3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips
      
      3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
      2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips
      
      23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
      19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips
      
      4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
      3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips
      
      967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
      887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
      8b4190da
  18. 12 Jan, 2014 3 commits
  19. 08 Jan, 2014 4 commits
    • Ronald S. Bultje's avatar
      vp9/x86: make STORE_2X2 macro local. · c6fe984f
      Ronald S. Bultje authored
      Prevents this assembler warning:
      libavcodec/x86/vp9itxfm.asm:1208: warning: (VP9_IDCT32_1D:309)
      redefining multi-line macro `STORE_2X2'
      Signed-off-by: 's avatarMichael Niedermayer <michaelni@gmx.at>
      c6fe984f
    • Ronald S. Bultje's avatar
      vp9/x86: idct_32x32_add_ssse3 sub-8x8-idct. · 04a187fb
      Ronald S. Bultje authored
      Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or
      from 1425 to 1306 cycles (inter). Overall runtime is not significantly
      affected.
      04a187fb
    • Ronald S. Bultje's avatar
      vp9/x86: idct_32x32_add_ssse3 sub-16x16-idct. · 37b001d1
      Ronald S. Bultje authored
      Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e.
      ~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total
      decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).
      37b001d1
    • Ronald S. Bultje's avatar
      vp9/x86: idct_32x32_add_ssse3. · e84d14df
      Ronald S. Bultje authored
      Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
      (13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
      to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
      or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
      form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
      or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
      tests done on ped1080p.webm).
      e84d14df
  20. 26 Dec, 2013 1 commit
    • Ronald S. Bultje's avatar
      vp9/x86: 16x16 sub-IDCT for top-left 8x8 subblock (eob <= 38). · 0d9375fc
      Ronald S. Bultje authored
      Sub8x8 speed (w/o dc-only case) goes from ~750 cycles (inter) or ~735
      cycles (intra) to ~415 cycles (inter) or ~430 cycles (intra). Average
      overall 16x16 idct speed goes from ~635 cycles (inter) or ~720 cycles
      (intra) to ~415 cycles (inter) or ~545 (intra) - all measurements done
      using ped1080p.webm.
      0d9375fc
  21. 14 Dec, 2013 1 commit
    • Ronald S. Bultje's avatar
      vp9/x86: idct_add_16x16_ssse3. · 8d4c616f
      Ronald S. Bultje authored
      Currently only dc-only and full 16x16. Other subforms will follow in the
      near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
      seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
      from ~4050 to ~745 cycles.
      8d4c616f
  22. 07 Dec, 2013 4 commits
  23. 21 Nov, 2013 1 commit
  24. 15 Nov, 2013 1 commit