1. 20 Mar, 2017 26 commits
  2. 19 Mar, 2017 14 commits
    • Martin Storsjö's avatar
      aarch64: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible · 61b8a9ea
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 21512 bytes to 31400 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:     284.6
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:    1902.7
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1903.0
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:    2201.1
      vp9_inv_dct_dct_16x16_sub12_add_10_neon:   2510.0
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2821.3
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1011.6
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:    9716.5
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    9704.9
      vp9_inv_dct_dct_32x32_sub8_add_10_neon:   10641.7
      vp9_inv_dct_dct_32x32_sub12_add_10_neon:  11555.7
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:  12499.8
      vp9_inv_dct_dct_32x32_sub20_add_10_neon:  13403.7
      vp9_inv_dct_dct_32x32_sub24_add_10_neon:  14335.8
      vp9_inv_dct_dct_32x32_sub28_add_10_neon:  15253.6
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16179.5
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:     282.8
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:    1142.4
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1139.0
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:    1772.9
      vp9_inv_dct_dct_16x16_sub12_add_10_neon:   2515.2
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2823.5
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1012.7
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:    6944.4
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    6944.2
      vp9_inv_dct_dct_32x32_sub8_add_10_neon:    7609.8
      vp9_inv_dct_dct_32x32_sub12_add_10_neon:   9953.4
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:  10770.1
      vp9_inv_dct_dct_32x32_sub20_add_10_neon:  13418.8
      vp9_inv_dct_dct_32x32_sub24_add_10_neon:  14330.7
      vp9_inv_dct_dct_32x32_sub28_add_10_neon:  15257.1
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16190.6
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      61b8a9ea
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible · eabc5abf
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 14516 bytes to 22484 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:                                 Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    270.7    418.5    295.4
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:    3840.2   3244.8   3700.1   2337.9
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4212.5   3575.4   3996.9   2571.6
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:    5174.4   4270.5   4615.5   3031.9
      vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5676.0   4908.5   5226.5   3491.3
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6403.9   5589.0   5839.8   3948.5
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1710.7    944.7   1582.1   1045.4
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:   21040.7  16706.1  18687.7  13193.1
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22197.7  18282.7  19577.5  13918.6
      vp9_inv_dct_dct_32x32_sub8_add_10_neon:   24511.5  20911.5  21472.5  15367.5
      vp9_inv_dct_dct_32x32_sub12_add_10_neon:  26939.5  24264.3  23239.1  16830.3
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:  29419.5  26845.1  25020.6  18259.9
      vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31146.4  29633.5  26803.3  19721.7
      vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33376.3  32507.8  28642.4  21174.2
      vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35629.4  35439.6  30416.5  22625.7
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37269.9  37914.9  32271.9  24078.9
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    276.0    418.5    295.1
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:    2336.2   1886.0   2251.0   1458.6
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    2531.0   2054.7   2402.8   1591.1
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:    3848.6   3491.1   3845.7   2554.8
      vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5703.8   4831.6   5230.8   3493.4
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6399.5   5567.0   5832.4   3951.5
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1722.1    938.5   1577.3   1044.5
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:   15003.5  11576.8  13105.8   9602.2
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   15768.5  12677.2  13726.0  10138.1
      vp9_inv_dct_dct_32x32_sub8_add_10_neon:   17278.8  14825.4  14907.5  11185.7
      vp9_inv_dct_dct_32x32_sub12_add_10_neon:  22335.7  21544.5  20379.5  15019.8
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:  24165.6  23881.7  21938.6  16308.2
      vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31082.2  30860.9  26835.3  19711.3
      vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33102.6  31922.8  28638.3  21161.0
      vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35104.9  34867.5  30411.7  22621.2
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37438.1  39103.4  32217.8  24067.6
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      eabc5abf
    • Martin Storsjö's avatar
      aarch64: vp9itxfm16: Move the load_add_store macro out from the itxfm16 pass2 function · d564c901
      Martin Storsjö authored
      This allows reusing the macro for a separate implementation of the
      pass2 function.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      d564c901
    • Martin Storsjö's avatar
      aarch64: vp9itxfm16: Make the larger core transforms standalone functions · 0f2705e6
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from
      26288 to 21512 bytes.
      
      This gives a small slowdown of a couple of tens of cycles, but makes
      it more feasible to add more optimized versions of these transforms.
      
      Before:
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1887.4
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2801.5
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    9691.4
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16154.9
      
      After:
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1899.5
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2827.2
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    9714.7
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16175.9
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      0f2705e6
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Make the larger core transforms standalone functions · 0ea60320
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from
      17500 to 14516 bytes.
      
      This gives a small slowdown of a couple tens of cycles, up to around
      150 cycles for the full case of the largest transform, but makes
      it more feasible to add more optimized versions of these transforms.
      
      Before:                                 Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4237.4   3561.5   3971.8   2525.3
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6371.9   5452.0   5779.3   3910.5
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22068.8  17867.5  19555.2  13871.6
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37268.9  38684.2  32314.2  23969.0
      
      After:
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4375.1   3571.9   4283.8   2567.2
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6415.6   5578.9   5844.6   3948.3
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22653.7  18079.7  19603.7  13905.3
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37593.2  38862.2  32235.8  24070.9
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      0ea60320
    • Martin Storsjö's avatar
      aarch64: vp9itxfm16: Restructure the idct32 store macros · b76533f1
      Martin Storsjö authored
      This avoids concatenation, which can't be used if the whole macro
      is wrapped within another macro.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b76533f1
    • Martin Storsjö's avatar
      aarch64: vp9itxfm16: Avoid .irp when it doesn't save any lines · d6132516
      Martin Storsjö authored
      This makes the code a bit more readable.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      d6132516
    • Martin Storsjö's avatar
      25ced1eb
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Avoid reloading the idct32 coefficients · 32e273c1
      Martin Storsjö authored
      Keep the idct32 coefficients in narrow form in q6-q7, and idct16
      coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering
      q0-q3 in the pass1 function, and squeeze the idct16 coefficients
      into q0-q1 in the pass2 function to avoid reloading them.
      
      The idct16 coefficients are clobbered and reloaded within idct32_odd
      though, since that turns out to be faster than narrowing them and
      swapping them into q6-q7.
      
      Before:                            Cortex       A7        A8        A9      A53
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22653.8   18268.4   19598.0  14079.0
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37699.0   38665.2   32542.3  24472.2
      After:
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22270.8   18159.3   19531.0  13865.0
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37523.3   37731.6   32181.7  24071.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      32e273c1
    • Martin Storsjö's avatar
      c1619318
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Use the right lane size · b46d37e9
      Martin Storsjö authored
      This makes the code slightly clearer, but doesn't make any functional
      difference.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b46d37e9
    • Martin Storsjö's avatar
      arm/aarch64: vp9: Fix vertical alignment · 21c89f3a
      Martin Storsjö authored
      Align the second/third operands as they usually are.
      
      Due to the wildly varying sizes of the written out operands
      in aarch64 assembly, the column alignment is usually not as clear
      as in arm assembly.
      
      This is cherrypicked from libav commit
      7995ebfa.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      21c89f3a
    • Martin Storsjö's avatar
      arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 70317b25
      Martin Storsjö authored
      In the half/quarter cases where we don't use the min_eob array, defer
      loading the pointer until we know it will be needed.
      
      This is cherrypicked from libav commit
      3a0d5e20.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      70317b25
    • Martin Storsjö's avatar
      arm: vp9itxfm: Template the quarter/half idct32 function · b7a565fe
      Martin Storsjö authored
      This reduces the number of lines and reduces the duplication.
      
      Also simplify the eob check for the half case.
      
      If we are in the half case, we know we at least will need to do the
      first three slices, we only need to check eob for the fourth one,
      so we can hardcode the value to check against instead of loading
      from the min_eob array.
      
      Since at most one slice can be skipped in the first pass, we can
      unroll the loop for filling zeros completely, as it was done for
      the quarter case before.
      
      This allows skipping loading the min_eob pointer when using the
      quarter/half cases.
      
      This is cherrypicked from libav commit
      98ee855a.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b7a565fe