1. 21 Mar, 2019 1 commit
  2. 20 Feb, 2019 1 commit
  3. 19 Feb, 2019 1 commit
  4. 09 Apr, 2018 1 commit
  5. 31 Mar, 2018 2 commits
  6. 30 Mar, 2018 1 commit
  7. 07 Mar, 2018 1 commit
  8. 12 Jan, 2018 1 commit
  9. 09 Dec, 2017 1 commit
    • James Almer's avatar
      arm/hevc_idct: fix compilation on Android · 36de24d5
      James Almer authored
      Compilation error "out of range" fixed for armeabi-v7a. Compilation failed
      trying to build libvlc.aar for ARM7 android on ubuntu 16.04 host. Error
      messages is "Offset out of range". The reason of the error is assembler LDR
      directives in function "ff_hevc_transform_luma_4x4_neon_8" need local storage
      in range <1k, but no such storage provided.
      
      Based on a patch by Ihor Bobalo <bob@eleks.com>
      
      Suggested-by: wbs
      Signed-off-by: 's avatarJames Almer <jamrial@gmail.com>
      36de24d5
  10. 08 Dec, 2017 1 commit
    • Alexandra Hájková's avatar
      hevc: Add hevc_get_pixel_4/8/12/16/24/32/48/64 · 7993ec19
      Alexandra Hájková authored
      Checkasm timings:
      block size bitdepth  C       NEON
      4           8 bit:    146.7   48.7
                 10 bit:    146.7   52.7
      8           8 bit:    430.3   84.4
                 10 bit:    430.4  119.5
      12          8 bit:    812.8  141.0
                 10 bit:    812.8  195.0
      16          8 bit:   1499.1  268.0
                 10 bit:   1498.9  368.4
      24          8 bit:   4394.2  574.8
                 10 bit:   3696.3  804.8
      32          8 bit:   5108.6  568.9
                 10 bit:   4249.6  918.8
      48          8 bit:  16819.6 2304.9
                 10 bit:  13882.0 3178.5
      64          8 bit:  13490.8 1799.5
                 10 bit:  11018.5 2519.4
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      7993ec19
  11. 24 Oct, 2017 1 commit
  12. 02 Sep, 2017 1 commit
  13. 11 Jul, 2017 1 commit
    • Muhammad Faiz's avatar
      avcodec/rdft: remove sintable · 0780ad9c
      Muhammad Faiz authored
      It is redundant with costable. The first half of sintable is
      identical with the second half of costable. The second half
      of sintable is negative value of the first half of sintable.
      
      The computation is changed to handle sign of sin values, in
      C code and ARM assembly code.
      Signed-off-by: 's avatarMuhammad Faiz <mfcc64@gmail.com>
      0780ad9c
  14. 28 Jun, 2017 2 commits
    • Clément Bœsch's avatar
    • Clément Bœsch's avatar
      lavc/arm: fix lack of precision in ff_ps_stereo_interpolate_neon · e4a27e2f
      Clément Bœsch authored
      The code originally pre-multiply by 2 the steps, causing the running sum
      of the h factors to drift away due to the lack of precision. It quickly
      causes an inaccuracy > 0.01.
      
      I tried diverse approaches such as multiply by 2.0 (instead of adding
      the value itself) without success.
      
      I'm unable to bench the impact of this change, feel free to compare.
      
      This commit fixes the incoming aacpsdsp tests.
      
      Following is an alternative simplified function (matching the incoming
      AArch64 code) that may be used:
      
      function ff_ps_stereo_interpolate_neon, export=1
              vld1.32         {q0}, [r2]
              vld1.32         {q1}, [r3]
              ldr             r12, [sp]
              vmov.f32        q8, q0
              vmov.f32        q9, q1
              vzip.32         q8, q0
              vzip.32         q9, q1
      1:
              vld1.32         {d4}, [r0,:64]
              vld1.32         {d6}, [r1,:64]
              vadd.f32        q8, q8, q9
              vadd.f32        q0, q0, q1
              vmov.f32        d5, d4
              vmov.f32        d7, d6
              vmul.f32        q2, q2, q8
              vmla.f32        q2, q3, q0
              vst1.32         {d4}, [r0,:64]!
              vst1.32         {d5}, [r1,:64]!
              subs            r12, r12, #1
              bgt             1b
              bx              lr
      endfunc
      e4a27e2f
  15. 15 May, 2017 1 commit
    • Martin Storsjö's avatar
      arm: Avoid using .dn register aliases · d7320ca3
      Martin Storsjö authored
      clang now (in the upcoming 5.0 version) is capable of building our
      arm assembly without relying on gas-preprocessor, although clang/LLVM
      doesn't support .dn register aliases.
      
      The VC1 MC assembly was only built and used if the chosen assembler
      supported the .dn directives though. This was supported as long as
      gas-preprocessor was used.
      
      This means that VC1 decoding got a speed regression on clang 5.0,
      unless the user manually chose using gas-preprocessor again.
      
      By avoiding using the .dn register aliases, we can build the VC1 MC
      assembly with the latest clang version.
      
      Support for the .dn/.qn directives in clang/LLVM isn't actively planned,
      see https://bugs.llvm.org/show_bug.cgi?id=18199.
      
      This partially reverts 896a5bff.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      d7320ca3
  16. 04 May, 2017 2 commits
  17. 01 May, 2017 1 commit
  18. 28 Apr, 2017 1 commit
  19. 27 Apr, 2017 1 commit
  20. 25 Apr, 2017 2 commits
  21. 12 Apr, 2017 1 commit
  22. 06 Apr, 2017 1 commit
  23. 28 Mar, 2017 3 commits
  24. 27 Mar, 2017 2 commits
  25. 20 Mar, 2017 1 commit
  26. 19 Mar, 2017 8 commits
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible · eabc5abf
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 14516 bytes to 22484 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:                                 Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    270.7    418.5    295.4
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:    3840.2   3244.8   3700.1   2337.9
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4212.5   3575.4   3996.9   2571.6
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:    5174.4   4270.5   4615.5   3031.9
      vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5676.0   4908.5   5226.5   3491.3
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6403.9   5589.0   5839.8   3948.5
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1710.7    944.7   1582.1   1045.4
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:   21040.7  16706.1  18687.7  13193.1
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22197.7  18282.7  19577.5  13918.6
      vp9_inv_dct_dct_32x32_sub8_add_10_neon:   24511.5  20911.5  21472.5  15367.5
      vp9_inv_dct_dct_32x32_sub12_add_10_neon:  26939.5  24264.3  23239.1  16830.3
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:  29419.5  26845.1  25020.6  18259.9
      vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31146.4  29633.5  26803.3  19721.7
      vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33376.3  32507.8  28642.4  21174.2
      vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35629.4  35439.6  30416.5  22625.7
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37269.9  37914.9  32271.9  24078.9
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    276.0    418.5    295.1
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:    2336.2   1886.0   2251.0   1458.6
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    2531.0   2054.7   2402.8   1591.1
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:    3848.6   3491.1   3845.7   2554.8
      vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5703.8   4831.6   5230.8   3493.4
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6399.5   5567.0   5832.4   3951.5
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1722.1    938.5   1577.3   1044.5
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:   15003.5  11576.8  13105.8   9602.2
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   15768.5  12677.2  13726.0  10138.1
      vp9_inv_dct_dct_32x32_sub8_add_10_neon:   17278.8  14825.4  14907.5  11185.7
      vp9_inv_dct_dct_32x32_sub12_add_10_neon:  22335.7  21544.5  20379.5  15019.8
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:  24165.6  23881.7  21938.6  16308.2
      vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31082.2  30860.9  26835.3  19711.3
      vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33102.6  31922.8  28638.3  21161.0
      vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35104.9  34867.5  30411.7  22621.2
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37438.1  39103.4  32217.8  24067.6
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      eabc5abf
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Make the larger core transforms standalone functions · 0ea60320
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from
      17500 to 14516 bytes.
      
      This gives a small slowdown of a couple tens of cycles, up to around
      150 cycles for the full case of the largest transform, but makes
      it more feasible to add more optimized versions of these transforms.
      
      Before:                                 Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4237.4   3561.5   3971.8   2525.3
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6371.9   5452.0   5779.3   3910.5
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22068.8  17867.5  19555.2  13871.6
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37268.9  38684.2  32314.2  23969.0
      
      After:
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4375.1   3571.9   4283.8   2567.2
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6415.6   5578.9   5844.6   3948.3
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22653.7  18079.7  19603.7  13905.3
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37593.2  38862.2  32235.8  24070.9
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      0ea60320
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Avoid reloading the idct32 coefficients · 32e273c1
      Martin Storsjö authored
      Keep the idct32 coefficients in narrow form in q6-q7, and idct16
      coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering
      q0-q3 in the pass1 function, and squeeze the idct16 coefficients
      into q0-q1 in the pass2 function to avoid reloading them.
      
      The idct16 coefficients are clobbered and reloaded within idct32_odd
      though, since that turns out to be faster than narrowing them and
      swapping them into q6-q7.
      
      Before:                            Cortex       A7        A8        A9      A53
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22653.8   18268.4   19598.0  14079.0
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37699.0   38665.2   32542.3  24472.2
      After:
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22270.8   18159.3   19531.0  13865.0
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37523.3   37731.6   32181.7  24071.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      32e273c1
    • Martin Storsjö's avatar
      c1619318
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Use the right lane size · b46d37e9
      Martin Storsjö authored
      This makes the code slightly clearer, but doesn't make any functional
      difference.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b46d37e9
    • Martin Storsjö's avatar
      arm/aarch64: vp9: Fix vertical alignment · 21c89f3a
      Martin Storsjö authored
      Align the second/third operands as they usually are.
      
      Due to the wildly varying sizes of the written out operands
      in aarch64 assembly, the column alignment is usually not as clear
      as in arm assembly.
      
      This is cherrypicked from libav commit
      7995ebfa.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      21c89f3a
    • Martin Storsjö's avatar
      arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 70317b25
      Martin Storsjö authored
      In the half/quarter cases where we don't use the min_eob array, defer
      loading the pointer until we know it will be needed.
      
      This is cherrypicked from libav commit
      3a0d5e20.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      70317b25
    • Martin Storsjö's avatar
      arm: vp9itxfm: Template the quarter/half idct32 function · b7a565fe
      Martin Storsjö authored
      This reduces the number of lines and reduces the duplication.
      
      Also simplify the eob check for the half case.
      
      If we are in the half case, we know we at least will need to do the
      first three slices, we only need to check eob for the fourth one,
      so we can hardcode the value to check against instead of loading
      from the min_eob array.
      
      Since at most one slice can be skipped in the first pass, we can
      unroll the loop for filling zeros completely, as it was done for
      the quarter case before.
      
      This allows skipping loading the min_eob pointer when using the
      quarter/half cases.
      
      This is cherrypicked from libav commit
      98ee855a.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b7a565fe