1. 06 Aug, 2014 1 commit
  2. 25 Jan, 2014 4 commits
    • Ronald S. Bultje's avatar
      vp9/x86: use explicit register for relative stack references. · c9e6325e
      Ronald S. Bultje authored
      Before this patch, we explicitly modify rsp, which isn't necessarily
      universally acceptable, since the space under the stack pointer might
      be modified in things like signal handlers. Therefore, use an explicit
      register to hold the stack pointer relative to the bottom of the stack
      (i.e. rsp). This will also clear out valgrind errors about the use of
      uninitialized data that started occurring after the idct16x16/ssse3
      optimizations were first merged.
      c9e6325e
    • Ronald S. Bultje's avatar
      vp9/x86: iwht4x4 (lossless) mmx. · 97474d52
      Ronald S. Bultje authored
      97474d52
    • Ronald S. Bultje's avatar
      vp9/x86: 4x4 iadst SIMD (ssse3) variants. · d43efa68
      Ronald S. Bultje authored
      Cycle measurements for intra itxfm_4x4_add on ped1080p.webm:
      idct_idct:    66 -> 67 cycles (noise measurement)
      idct_iadst:  199 -> 79 cycles
      iadst_idct:  165 -> 70 cycles
      iadst_iadst: 183 -> 82 cycles
      d43efa68
    • Ronald S. Bultje's avatar
      vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants. · baf47020
      Ronald S. Bultje authored
      Cycle measurements for intra itxfm_8x8_add on ped1080p.webm:
      idct_idct:   133 -> 135 cycles (noise measurement)
      idct_iadst:  900 -> 241 cycles
      iadst_idct:  864 -> 215 cycles
      iadst_iadst: 973 -> 310 cycles
      baf47020
  3. 16 Jan, 2014 1 commit
  4. 15 Jan, 2014 1 commit
    • Clément Bœsch's avatar
      vp9/x86: add AVX for itxfm and lpf. · 8b4190da
      Clément Bœsch authored
      4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
      3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips
      
      3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
      2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips
      
      23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
      19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips
      
      4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
      3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips
      
      967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
      887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
      8b4190da
  5. 12 Jan, 2014 3 commits
  6. 08 Jan, 2014 4 commits
    • Ronald S. Bultje's avatar
      vp9/x86: make STORE_2X2 macro local. · c6fe984f
      Ronald S. Bultje authored
      Prevents this assembler warning:
      libavcodec/x86/vp9itxfm.asm:1208: warning: (VP9_IDCT32_1D:309)
      redefining multi-line macro `STORE_2X2'
      Signed-off-by: 's avatarMichael Niedermayer <michaelni@gmx.at>
      c6fe984f
    • Ronald S. Bultje's avatar
      vp9/x86: idct_32x32_add_ssse3 sub-8x8-idct. · 04a187fb
      Ronald S. Bultje authored
      Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or
      from 1425 to 1306 cycles (inter). Overall runtime is not significantly
      affected.
      04a187fb
    • Ronald S. Bultje's avatar
      vp9/x86: idct_32x32_add_ssse3 sub-16x16-idct. · 37b001d1
      Ronald S. Bultje authored
      Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e.
      ~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total
      decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).
      37b001d1
    • Ronald S. Bultje's avatar
      vp9/x86: idct_32x32_add_ssse3. · e84d14df
      Ronald S. Bultje authored
      Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
      (13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
      to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
      or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
      form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
      or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
      tests done on ped1080p.webm).
      e84d14df
  7. 26 Dec, 2013 1 commit
    • Ronald S. Bultje's avatar
      vp9/x86: 16x16 sub-IDCT for top-left 8x8 subblock (eob <= 38). · 0d9375fc
      Ronald S. Bultje authored
      Sub8x8 speed (w/o dc-only case) goes from ~750 cycles (inter) or ~735
      cycles (intra) to ~415 cycles (inter) or ~430 cycles (intra). Average
      overall 16x16 idct speed goes from ~635 cycles (inter) or ~720 cycles
      (intra) to ~415 cycles (inter) or ~545 (intra) - all measurements done
      using ped1080p.webm.
      0d9375fc
  8. 14 Dec, 2013 1 commit
    • Ronald S. Bultje's avatar
      vp9/x86: idct_add_16x16_ssse3. · 8d4c616f
      Ronald S. Bultje authored
      Currently only dc-only and full 16x16. Other subforms will follow in the
      near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
      seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
      from ~4050 to ~745 cycles.
      8d4c616f
  9. 07 Dec, 2013 4 commits
  10. 21 Nov, 2013 1 commit
  11. 15 Nov, 2013 1 commit
  12. 05 Nov, 2013 1 commit
    • Clément Bœsch's avatar
      avcodec/vp9: add ff_vp9_idct_idct_{4x4,8x8}_ssse3(). · 87434cf3
      Clément Bœsch authored
      1789 decicycles in idct_idct_4x4_add_c, 262136 runs, 8 skips
      1839 decicycles in idct_idct_4x4_add_c, 524270 runs, 18 skips
      1864 decicycles in idct_idct_4x4_add_c, 1048548 runs, 28 skips
      
      529 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 262138 runs, 6 skips
      516 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 524282 runs, 6 skips
      474 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 1048565 runs, 11 skips
      
      (~3.9x faster)
      
      7726 decicycles in idct_idct_8x8_add_c, 1048433 runs, 143 skips
      7732 decicycles in idct_idct_8x8_add_c, 2096882 runs, 270 skips
      7731 decicycles in idct_idct_8x8_add_c, 4193772 runs, 532 skips
      
      1145 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 1048549 runs, 27 skips
      1137 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 2097097 runs, 55 skips
      1086 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 4194188 runs, 116 skips
      
      (~7.1x faster)
      
      Overall decode time before commit:
        16.48s user 0.03s system 99% cpu 16.526 total
        16.54s user 0.01s system 99% cpu 16.566 total
        16.46s user 0.03s system 99% cpu 16.511 total
      
      Overall decode time after commit:
        16.34s user 0.02s system 99% cpu 16.378 total
        16.28s user 0.02s system 99% cpu 16.315 total
        16.32s user 0.03s system 99% cpu 16.366 total
      
      Tested on i7 920 with 40s 1080p footage.
      87434cf3
  13. 08 Oct, 2013 1 commit
  14. 03 Oct, 2013 2 commits