1. 11 Jul, 2017 1 commit
    • Muhammad Faiz's avatar
      avcodec/rdft: remove sintable · 0780ad9c
      Muhammad Faiz authored
      It is redundant with costable. The first half of sintable is
      identical with the second half of costable. The second half
      of sintable is negative value of the first half of sintable.
      
      The computation is changed to handle sign of sin values, in
      C code and ARM assembly code.
      Signed-off-by: 's avatarMuhammad Faiz <mfcc64@gmail.com>
      0780ad9c
  2. 28 Jun, 2017 2 commits
    • Clément Bœsch's avatar
    • Clément Bœsch's avatar
      lavc/arm: fix lack of precision in ff_ps_stereo_interpolate_neon · e4a27e2f
      Clément Bœsch authored
      The code originally pre-multiply by 2 the steps, causing the running sum
      of the h factors to drift away due to the lack of precision. It quickly
      causes an inaccuracy > 0.01.
      
      I tried diverse approaches such as multiply by 2.0 (instead of adding
      the value itself) without success.
      
      I'm unable to bench the impact of this change, feel free to compare.
      
      This commit fixes the incoming aacpsdsp tests.
      
      Following is an alternative simplified function (matching the incoming
      AArch64 code) that may be used:
      
      function ff_ps_stereo_interpolate_neon, export=1
              vld1.32         {q0}, [r2]
              vld1.32         {q1}, [r3]
              ldr             r12, [sp]
              vmov.f32        q8, q0
              vmov.f32        q9, q1
              vzip.32         q8, q0
              vzip.32         q9, q1
      1:
              vld1.32         {d4}, [r0,:64]
              vld1.32         {d6}, [r1,:64]
              vadd.f32        q8, q8, q9
              vadd.f32        q0, q0, q1
              vmov.f32        d5, d4
              vmov.f32        d7, d6
              vmul.f32        q2, q2, q8
              vmla.f32        q2, q3, q0
              vst1.32         {d4}, [r0,:64]!
              vst1.32         {d5}, [r1,:64]!
              subs            r12, r12, #1
              bgt             1b
              bx              lr
      endfunc
      e4a27e2f
  3. 15 May, 2017 1 commit
    • Martin Storsjö's avatar
      arm: Avoid using .dn register aliases · d7320ca3
      Martin Storsjö authored
      clang now (in the upcoming 5.0 version) is capable of building our
      arm assembly without relying on gas-preprocessor, although clang/LLVM
      doesn't support .dn register aliases.
      
      The VC1 MC assembly was only built and used if the chosen assembler
      supported the .dn directives though. This was supported as long as
      gas-preprocessor was used.
      
      This means that VC1 decoding got a speed regression on clang 5.0,
      unless the user manually chose using gas-preprocessor again.
      
      By avoiding using the .dn register aliases, we can build the VC1 MC
      assembly with the latest clang version.
      
      Support for the .dn/.qn directives in clang/LLVM isn't actively planned,
      see https://bugs.llvm.org/show_bug.cgi?id=18199.
      
      This partially reverts 896a5bff.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      d7320ca3
  4. 04 May, 2017 2 commits
  5. 01 May, 2017 1 commit
  6. 28 Apr, 2017 1 commit
  7. 27 Apr, 2017 1 commit
  8. 25 Apr, 2017 2 commits
  9. 12 Apr, 2017 1 commit
  10. 06 Apr, 2017 1 commit
  11. 28 Mar, 2017 3 commits
  12. 27 Mar, 2017 2 commits
  13. 20 Mar, 2017 1 commit
  14. 19 Mar, 2017 8 commits
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible · eabc5abf
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 14516 bytes to 22484 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:                                 Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    270.7    418.5    295.4
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:    3840.2   3244.8   3700.1   2337.9
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4212.5   3575.4   3996.9   2571.6
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:    5174.4   4270.5   4615.5   3031.9
      vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5676.0   4908.5   5226.5   3491.3
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6403.9   5589.0   5839.8   3948.5
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1710.7    944.7   1582.1   1045.4
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:   21040.7  16706.1  18687.7  13193.1
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22197.7  18282.7  19577.5  13918.6
      vp9_inv_dct_dct_32x32_sub8_add_10_neon:   24511.5  20911.5  21472.5  15367.5
      vp9_inv_dct_dct_32x32_sub12_add_10_neon:  26939.5  24264.3  23239.1  16830.3
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:  29419.5  26845.1  25020.6  18259.9
      vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31146.4  29633.5  26803.3  19721.7
      vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33376.3  32507.8  28642.4  21174.2
      vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35629.4  35439.6  30416.5  22625.7
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37269.9  37914.9  32271.9  24078.9
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    276.0    418.5    295.1
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:    2336.2   1886.0   2251.0   1458.6
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    2531.0   2054.7   2402.8   1591.1
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:    3848.6   3491.1   3845.7   2554.8
      vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5703.8   4831.6   5230.8   3493.4
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6399.5   5567.0   5832.4   3951.5
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1722.1    938.5   1577.3   1044.5
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:   15003.5  11576.8  13105.8   9602.2
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   15768.5  12677.2  13726.0  10138.1
      vp9_inv_dct_dct_32x32_sub8_add_10_neon:   17278.8  14825.4  14907.5  11185.7
      vp9_inv_dct_dct_32x32_sub12_add_10_neon:  22335.7  21544.5  20379.5  15019.8
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:  24165.6  23881.7  21938.6  16308.2
      vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31082.2  30860.9  26835.3  19711.3
      vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33102.6  31922.8  28638.3  21161.0
      vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35104.9  34867.5  30411.7  22621.2
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37438.1  39103.4  32217.8  24067.6
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      eabc5abf
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Make the larger core transforms standalone functions · 0ea60320
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from
      17500 to 14516 bytes.
      
      This gives a small slowdown of a couple tens of cycles, up to around
      150 cycles for the full case of the largest transform, but makes
      it more feasible to add more optimized versions of these transforms.
      
      Before:                                 Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4237.4   3561.5   3971.8   2525.3
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6371.9   5452.0   5779.3   3910.5
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22068.8  17867.5  19555.2  13871.6
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37268.9  38684.2  32314.2  23969.0
      
      After:
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4375.1   3571.9   4283.8   2567.2
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6415.6   5578.9   5844.6   3948.3
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22653.7  18079.7  19603.7  13905.3
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37593.2  38862.2  32235.8  24070.9
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      0ea60320
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Avoid reloading the idct32 coefficients · 32e273c1
      Martin Storsjö authored
      Keep the idct32 coefficients in narrow form in q6-q7, and idct16
      coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering
      q0-q3 in the pass1 function, and squeeze the idct16 coefficients
      into q0-q1 in the pass2 function to avoid reloading them.
      
      The idct16 coefficients are clobbered and reloaded within idct32_odd
      though, since that turns out to be faster than narrowing them and
      swapping them into q6-q7.
      
      Before:                            Cortex       A7        A8        A9      A53
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22653.8   18268.4   19598.0  14079.0
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37699.0   38665.2   32542.3  24472.2
      After:
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22270.8   18159.3   19531.0  13865.0
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37523.3   37731.6   32181.7  24071.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      32e273c1
    • Martin Storsjö's avatar
      c1619318
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Use the right lane size · b46d37e9
      Martin Storsjö authored
      This makes the code slightly clearer, but doesn't make any functional
      difference.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b46d37e9
    • Martin Storsjö's avatar
      arm/aarch64: vp9: Fix vertical alignment · 21c89f3a
      Martin Storsjö authored
      Align the second/third operands as they usually are.
      
      Due to the wildly varying sizes of the written out operands
      in aarch64 assembly, the column alignment is usually not as clear
      as in arm assembly.
      
      This is cherrypicked from libav commit
      7995ebfa.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      21c89f3a
    • Martin Storsjö's avatar
      arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 70317b25
      Martin Storsjö authored
      In the half/quarter cases where we don't use the min_eob array, defer
      loading the pointer until we know it will be needed.
      
      This is cherrypicked from libav commit
      3a0d5e20.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      70317b25
    • Martin Storsjö's avatar
      arm: vp9itxfm: Template the quarter/half idct32 function · b7a565fe
      Martin Storsjö authored
      This reduces the number of lines and reduces the duplication.
      
      Also simplify the eob check for the half case.
      
      If we are in the half case, we know we at least will need to do the
      first three slices, we only need to check eob for the fourth one,
      so we can hardcode the value to check against instead of loading
      from the min_eob array.
      
      Since at most one slice can be skipped in the first pass, we can
      unroll the loop for filling zeros completely, as it was done for
      the quarter case before.
      
      This allows skipping loading the min_eob pointer when using the
      quarter/half cases.
      
      This is cherrypicked from libav commit
      98ee855a.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b7a565fe
  15. 16 Mar, 2017 1 commit
  16. 11 Mar, 2017 12 commits
    • Martin Storsjö's avatar
      arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 3a0d5e20
      Martin Storsjö authored
      In the half/quarter cases where we don't use the min_eob array, defer
      loading the pointer until we know it will be needed.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3a0d5e20
    • Martin Storsjö's avatar
      arm: vp9itxfm: Template the quarter/half idct32 function · 98ee855a
      Martin Storsjö authored
      This reduces the number of lines and reduces the duplication.
      
      Also simplify the eob check for the half case.
      
      If we are in the half case, we know we at least will need to do the
      first three slices, we only need to check eob for the fourth one,
      so we can hardcode the value to check against instead of loading
      from the min_eob array.
      
      Since at most one slice can be skipped in the first pass, we can
      unroll the loop for filling zeros completely, as it was done for
      the quarter case before.
      
      This allows skipping loading the min_eob pointer when using the
      quarter/half cases.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      98ee855a
    • Martin Storsjö's avatar
      arm: vp9itxfm: Reorder iadst16 coeffs · b2e20d89
      Martin Storsjö authored
      This matches the order they are in the 16 bpp version.
      
      There they are in this order, to make sure we access them in the
      same order they are declared, easing loading only half of the
      coefficients at a time.
      
      This makes the 8 bpp version match the 16 bpp version better.
      
      This is cherrypicked from libav commit
      08074c09.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b2e20d89
    • Martin Storsjö's avatar
      arm: vp9itxfm: Reorder the idct coefficients for better pairing · 4f693b56
      Martin Storsjö authored
      All elements are used pairwise, except for the first one.
      Previously, the 16th element was unused. Move the unused element
      to the second slot, to make the later element pairs not split
      across registers.
      
      This simplifies loading only parts of the coefficients,
      reducing the difference to the 16 bpp version.
      
      This is cherrypicked from libav commit
      de06bdfe.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      4f693b56
    • Martin Storsjö's avatar
      arm: vp9itxfm: Avoid reloading the idct32 coefficients · 600f4c9b
      Martin Storsjö authored
      The idct32x32 function actually pushed q4-q7 onto the stack even
      though it didn't clobber them; there are plenty of registers that
      can be used to allow keeping all the idct coefficients in registers
      without having to reload different subsets of them at different
      stages in the transform.
      
      Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
      q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
      in the idct16 function), and the lanewise vmul needs a register in
      the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
      while doing idct16.
      
      While keeping these coefficients in registers, we still can skip pushing
      q7.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
      After:
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8
      
      This is cherrypicked from libav commit
      402546a1.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      600f4c9b
    • Martin Storsjö's avatar
      arm: vp9lpf: Implement the mix2_44 function with one single filter pass · a88db8b9
      Martin Storsjö authored
      For this case, with 8 inputs but only changing 4 of them, we can fit
      all 16 input pixels into a q register, and still have enough temporary
      registers for doing the loop filter.
      
      The wd=8 filters would require too many temporary registers for
      processing all 16 pixels at once though.
      
      Before:                          Cortex A7      A8     A9     A53
      vp9_loop_filter_mix2_v_44_16_neon:   289.7   256.2  237.5   181.2
      After:
      vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  177.7   138.0
      
      This is cherrypicked from libav commit
      575e31e9.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      a88db8b9
    • Martin Storsjö's avatar
      arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit · 3fbbad29
      Martin Storsjö authored
      The theoretical maximum value of E is 193, so we can just
      saturate the addition to 255.
      
      Before:                     Cortex A7      A8      A9     A53  A53/AArch64
      vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
      vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
      After:
      vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
      vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
      vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
      vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7
      
      This is cherrypicked from libav commit
      c582cb85.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3fbbad29
    • Martin Storsjö's avatar
      arm: vp9lpf: Interleave the start of flat8in into the calculation above · 83399cf5
      Martin Storsjö authored
      This adds lots of extra .ifs, but speeds it up by a couple cycles,
      by avoiding stalls.
      
      This is cherrypicked from libav commit
      e18c3900.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      83399cf5
    • Martin Storsjö's avatar
      arm: vp9lpf: Use orrs instead of orr+cmp · 92ab8374
      Martin Storsjö authored
      This is cherrypicked from libav commit
      435cd7bc.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      92ab8374
    • Martin Storsjö's avatar
      arm/aarch64: vp9lpf: Calculate !hev directly · f0ecbb13
      Martin Storsjö authored
      Previously we first calculated hev, and then negated it.
      
      Since we were able to schedule the negation in the middle
      of another calculation, we don't see any gain in all cases.
      
      Before:                     Cortex A7      A8      A9     A53  A53/AArch64
      vp9_loop_filter_v_4_8_neon:     147.0   129.0   115.8    89.0         88.7
      vp9_loop_filter_v_8_8_neon:     242.0   198.5   174.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    500.0   419.5   382.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   971.2   825.5   731.5   579.0        453.0
      After:
      vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
      vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
      
      This is cherrypicked from libav commit
      e1f9de86.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      f0ecbb13
    • Martin Storsjö's avatar
      arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling · 758302e4
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      Before:                            Cortex A7      A8      A9     A53
      vp9_inv_dct_dct_16x16_sub1_add_neon:   273.0   189.5   211.7   235.8
      vp9_inv_dct_dct_32x32_sub1_add_neon:   752.0   459.2   862.2   553.9
      After:
      vp9_inv_dct_dct_16x16_sub1_add_neon:   226.5   145.0   225.1   171.8
      vp9_inv_dct_dct_32x32_sub1_add_neon:   721.2   415.7   727.6   475.0
      
      This is cherrypicked from libav commit
      a76bf8cf.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      758302e4
    • Martin Storsjö's avatar
      arm: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter · bff07715
      Martin Storsjö authored
      Before:                    Cortex A7      A8     A9     A53
      vp9_put_8tap_smooth_4h_neon:   378.1   273.2  340.7   229.5
      After:
      vp9_put_8tap_smooth_4h_neon:   352.1   222.2  290.5   229.5
      
      This is cherrypicked from libav commit
      fea92a4b.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      bff07715