1. 30 Jul, 2017 1 commit
    • Rostislav Pehlivanov's avatar
      mdct15: add inverse transform postrotation SIMD · 70eb77b3
      Rostislav Pehlivanov authored
      2.5ms frames:
      Before   (c):  2638 decicycles in postrotate, 2097040 runs,    112 skips
      After (sse3):  1467 decicycles in postrotate, 2097083 runs,     69 skips
      After (avx2):  1244 decicycles in postrotate, 2097085 runs,     67 skips
      
      5ms frames:
      Before   (c):  4987 decicycles in postrotate, 1048371 runs,    205 skips
      After (sse3):  2644 decicycles in postrotate, 1048509 runs,     67 skips
      After (avx2):  2031 decicycles in postrotate, 1048523 runs,     53 skips
      
      10ms frames:
      Before   (c):  9153 decicycles in postrotate,  523575 runs,    713 skips
      After (sse3):  5110 decicycles in postrotate,  523726 runs,    562 skips
      After (avx2):  3738 decicycles in postrotate,  524223 runs,     65 skips
      
      20ms frames:
      Before   (c): 17857 decicycles in postrotate,  261866 runs,    278 skips
      After (sse3): 10041 decicycles in postrotate,  261746 runs,    398 skips
      After (avx2):  7050 decicycles in postrotate,  262116 runs,     28 skips
      
      Improves total decoding performance for real world content by 9% with avx2.
      Signed-off-by: 's avatarRostislav Pehlivanov <atomnuker@gmail.com>
      70eb77b3
  2. 25 Jul, 2017 1 commit
  3. 11 Jul, 2017 1 commit
  4. 23 Jun, 2017 1 commit
  5. 07 Apr, 2017 1 commit
  6. 22 Mar, 2017 1 commit
  7. 14 Feb, 2017 1 commit
  8. 05 Jan, 2017 2 commits
    • Rostislav Pehlivanov's avatar
      imdct15: replace the FFT with a faster PFA FFT algorithm · 2d208aaa
      Rostislav Pehlivanov authored
      This commit replaces the current inefficient non-power-of-two FFT with a
      much faster FFT based on the Prime Factor Algorithm.
      Although it is already much faster than the old algorithm without SIMD,
      the new algorithm makes use of the already very throughouly SIMD'd power
      of two FFT, which improves performance even more across all platforms
      which we have SIMD support for.
      
      Most of the work was done by Peter Barfuss, who passed the code to me to
      implement into the iMDCT and the current codebase. The code for a
      5-point and 15-point FFT was derived from the previous implementation,
      although it was optimized and simplified, which will make its future
      SIMD easier. The 15-point FFT is currently using 6% of the current
      overall decoder overhead.
      
      The FFT can now easily be used as a forward transform by simply not
      multiplying the 5-point FFT's imaginary component by -1 (which comes
      from the fact that changing the complex exponential's angle by -1 also
      changes the output by that) and by multiplying the "theta" angle of the
      main exptab by -1. Hence the deliberately left multiplication by -1 at
      the end.
      
      FATE passes, and performance reports on other platforms/CPUs are
      welcome.
      
      Performance comparisons:
      
      iMDCT, PFA:
      101127 decicycles in speed,   32765 runs,      3 skips
      iMDCT, Old:
      211022 decicycles in speed,   32768 runs,      0 skips
      
      Standalone FFT, 300000 transforms of size 960:
          PFA        Old FFT     kiss_fft    libfftw3f
          3.659695s, 15.726912s, 13.300789s, 1.182222s
      
      Being only 3x slower than libfftw3f is a big achievement by itself.
      
      There appears to be something capping the performance in the iMDCT side
      of things, possibly during the pre-stage reindexing. However, it is
      certainly fast enough for now.
      Signed-off-by: 's avatarRostislav Pehlivanov <atomnuker@gmail.com>
      2d208aaa
    • Rostislav Pehlivanov's avatar
      imdct15: remove the AArch64 assembly · 4fdacf4c
      Rostislav Pehlivanov authored
      Prep work for the next commit, which will add a new FFT algorithm
      which makes the iMDCT over 3x faster than it is currently (standalone,
      the FFT is with some framesizes over 10x faster).
      
      The new FFT algorithm uses the already thouroughly SIMD'd power of two
      FFT which already has SIMD for AArch64, so users of that platform will
      still see an improvement.
      
      The previous FFT+SIMD was barely 2.5x faster than the C versions on these
      platforms.
      Signed-off-by: 's avatarRostislav Pehlivanov <atomnuker@gmail.com>
      4fdacf4c
  9. 02 Feb, 2015 1 commit
  10. 12 Jan, 2015 1 commit
  11. 15 May, 2014 2 commits
    • Janne Grunau's avatar
      aarch64: opus NEON iMDCT and FFT · d3f5b947
      Janne Grunau authored
      Opus celt decoding 11% faster and the iMDCT over 2.5 times faster on
      Apple's A7.
      d3f5b947
    • Anton Khirnov's avatar
      lavc: add a native Opus decoder. · b70d7a4a
      Anton Khirnov authored
      Initial implementation by Andrew D'Addesio <modchipv12@gmail.com> during
      GSoC 2012.
      
      Completion by Anton Khirnov <anton@khirnov.net>, sponsored by the
      Mozilla Corporation.
      
      Further contributions by:
      Christophe Gisquet <christophe.gisquet@gmail.com>
      Janne Grunau <janne-libav@jannau.net>
      Luca Barbato <lu_zero@gentoo.org>
      b70d7a4a