1. 04 May, 2016 1 commit
  2. 01 Jul, 2014 1 commit
  3. 13 Mar, 2014 1 commit
  4. 04 Nov, 2013 1 commit
  5. 30 Aug, 2013 1 commit
  6. 13 Nov, 2012 1 commit
  7. 30 Oct, 2012 2 commits
  8. 07 Aug, 2012 1 commit
  9. 15 May, 2012 1 commit
  10. 10 May, 2012 1 commit
    • Christophe Gisquet's avatar
      rv40dsp x86: MMX/MMX2/3DNow/SSE2/SSSE3 implementations of MC · 110d0cdc
      Christophe Gisquet authored
      Code mostly inspired by vp8's MC, however:
      - its MMX2 horizontal filter is worse because it can't take advantage of
        the coefficient redundancy
      - that same coefficient redundancy allows better code for non-SSSE3 versions
      
      Benchmark (rounded to tens of unit):
              V8x8  H8x8  2D8x8  V16x16  H16x16  2D16x16
      C       445    358   985    1785    1559    3280
      MMX*    219    271   478     714     929    1443
      SSE2    131    158   294     425     515     892
      SSSE3   120    122   248     387     390     763
      
      End result is overall around a 15% speedup for SSSE3 version (on 6 sequences);
      all loop filter functions now take around 55% of decoding time, while luma MC
      dsp functions are around 6%, chroma ones are 1.3% and biweight around 2.3%.
      Signed-off-by: 's avatarDiego Biurrun <diego@biurrun.de>
      110d0cdc
  11. 10 Apr, 2012 2 commits
  12. 03 Feb, 2012 1 commit
  13. 30 Jan, 2012 1 commit
    • Christophe Gisquet's avatar
      rv40: x86 SIMD for biweight · e5c9de2a
      Christophe Gisquet authored
      Provide MMX, SSE2 and SSSE3 versions, with a fast-path when the weights are
      multiples of 512 (which is often the case when the values round up nicely).
      
      *_TIMER report for the 16x16 and 8x8 cases:
      C:
      9015 decicycles in 16, 524257 runs, 31 skips
      2656 decicycles in 8, 524271 runs, 17 skips
      MMX:
      4156 decicycles in 16, 262090 runs, 54 skips
      1206 decicycles in 8, 262131 runs, 13 skips
      MMX on fast-path:
      2760 decicycles in 16, 524222 runs, 66 skips
      995 decicycles in 8, 524252 runs, 36 skips
      SSE2:
      2163 decicycles in 16, 262131 runs, 13 skips
      832 decicycles in 8, 262137 runs, 7 skips
      SSE2 with fast path:
      1783 decicycles in 16, 524276 runs, 12 skips
      711 decicycles in 8, 524283 runs, 5 skips
      SSSE3:
      2117 decicycles in 16, 262136 runs, 8 skips
      814 decicycles in 8, 262143 runs, 1 skips
      SSSE3 with fast path:
      1315 decicycles in 16, 524285 runs, 3 skips
      578 decicycles in 8, 524286 runs, 2 skips
      
      This means around a 4% speedup for some sequences.
      Signed-off-by: 's avatarDiego Biurrun <diego@biurrun.de>
      e5c9de2a