1. 17 Jul, 2014 1 commit
    • Ben Avison's avatar
      armv6: Accelerate ff_imdct_half for general case (mdct_bits != 6) · 5c22e8e4
      Ben Avison authored
      The previous implementation targeted DTS Coherent Acoustics, which only
      requires mdct_bits == 6. This relatively small size lent itself to
      unrolling the loops a small number of times, and encoding offsets
      calculated at assembly time within the load/store instructions of each
      iteration.
      
      In the more general case (codecs such as AAC and AC3) much larger arrays
      are used - mdct_bits == [8, 9, 11]. The old method does not scale for
      these cases, so more integer registers are used with non-unrolled versions
      of the loops (and with some stack spillage). The postrotation filter loop
      is still unrolled by a factor of 2 to permit the double-buffering of some
      VFP registers to facilitate overlap of neighbouring iterations.
      
      I benchmarked the result by measuring the number of gperftools samples
      that hit anywhere in the AAC decoder (starting from aac_decode_frame())
      or specifically in ff_imdct_half_c / ff_imdct_half_vfp, for the same
      example AAC stream:
      
                        Before          After
                        Mean   StdDev   Mean   StdDev  Confidence  Change
      aac_decode_frame  2368.1 35.8     2117.2 35.3    100.0%      +11.8%
      ff_imdct_half_*   457.5  22.4     251.2  16.2    100.0%      +82.1%
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      5c22e8e4
  2. 13 Jul, 2014 1 commit
    • Ben Avison's avatar
      armv6: Accelerate ff_imdct_half for general case (mdct_bits != 6) · 42c1cc35
      Ben Avison authored
      The previous implementation targeted DTS Coherent Acoustics, which only
      requires mdct_bits == 6. This relatively small size lent itself to
      unrolling the loops a small number of times, and encoding offsets
      calculated at assembly time within the load/store instructions of each
      iteration.
      
      In the more general case (codecs such as AAC and AC3) much larger arrays
      are used - mdct_bits == [8, 9, 11]. The old method does not scale for
      these cases, so more integer registers are used with non-unrolled versions
      of the loops (and with some stack spillage). The postrotation filter loop
      is still unrolled by a factor of 2 to permit the double-buffering of some
      VFP registers to facilitate overlap of neighbouring iterations.
      
      I benchmarked the result by measuring the number of gperftools samples
      that hit anywhere in the AAC decoder (starting from aac_decode_frame())
      or specifically in ff_imdct_half_c / ff_imdct_half_vfp, for the same
      example AAC stream:
      
                        Before          After
                        Mean   StdDev   Mean   StdDev  Confidence  Change
      aac_decode_frame  2368.1 35.8     2117.2 35.3    100.0%      +11.8%
      ff_imdct_half_*   457.5  22.4     251.2  16.2    100.0%      +82.1%
      Signed-off-by: 's avatarMichael Niedermayer <michaelni@gmx.at>
      42c1cc35
  3. 22 Jul, 2013 4 commits