1. 19 Mar, 2017 3 commits
  2. 16 Mar, 2017 1 commit
  3. 11 Mar, 2017 11 commits
    • Martin Storsjö's avatar
      arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 3a0d5e20
      Martin Storsjö authored
      In the half/quarter cases where we don't use the min_eob array, defer
      loading the pointer until we know it will be needed.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3a0d5e20
    • Martin Storsjö's avatar
      arm: vp9itxfm: Template the quarter/half idct32 function · 98ee855a
      Martin Storsjö authored
      This reduces the number of lines and reduces the duplication.
      
      Also simplify the eob check for the half case.
      
      If we are in the half case, we know we at least will need to do the
      first three slices, we only need to check eob for the fourth one,
      so we can hardcode the value to check against instead of loading
      from the min_eob array.
      
      Since at most one slice can be skipped in the first pass, we can
      unroll the loop for filling zeros completely, as it was done for
      the quarter case before.
      
      This allows skipping loading the min_eob pointer when using the
      quarter/half cases.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      98ee855a
    • Martin Storsjö's avatar
      arm: vp9itxfm: Reorder iadst16 coeffs · b2e20d89
      Martin Storsjö authored
      This matches the order they are in the 16 bpp version.
      
      There they are in this order, to make sure we access them in the
      same order they are declared, easing loading only half of the
      coefficients at a time.
      
      This makes the 8 bpp version match the 16 bpp version better.
      
      This is cherrypicked from libav commit
      08074c09.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b2e20d89
    • Martin Storsjö's avatar
      arm: vp9itxfm: Reorder the idct coefficients for better pairing · 4f693b56
      Martin Storsjö authored
      All elements are used pairwise, except for the first one.
      Previously, the 16th element was unused. Move the unused element
      to the second slot, to make the later element pairs not split
      across registers.
      
      This simplifies loading only parts of the coefficients,
      reducing the difference to the 16 bpp version.
      
      This is cherrypicked from libav commit
      de06bdfe.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      4f693b56
    • Martin Storsjö's avatar
      arm: vp9itxfm: Avoid reloading the idct32 coefficients · 600f4c9b
      Martin Storsjö authored
      The idct32x32 function actually pushed q4-q7 onto the stack even
      though it didn't clobber them; there are plenty of registers that
      can be used to allow keeping all the idct coefficients in registers
      without having to reload different subsets of them at different
      stages in the transform.
      
      Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
      q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
      in the idct16 function), and the lanewise vmul needs a register in
      the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
      while doing idct16.
      
      While keeping these coefficients in registers, we still can skip pushing
      q7.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
      After:
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8
      
      This is cherrypicked from libav commit
      402546a1.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      600f4c9b
    • Martin Storsjö's avatar
      arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling · 758302e4
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      Before:                            Cortex A7      A8      A9     A53
      vp9_inv_dct_dct_16x16_sub1_add_neon:   273.0   189.5   211.7   235.8
      vp9_inv_dct_dct_32x32_sub1_add_neon:   752.0   459.2   862.2   553.9
      After:
      vp9_inv_dct_dct_16x16_sub1_add_neon:   226.5   145.0   225.1   171.8
      vp9_inv_dct_dct_32x32_sub1_add_neon:   721.2   415.7   727.6   475.0
      
      This is cherrypicked from libav commit
      a76bf8cf.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      758302e4
    • Martin Storsjö's avatar
      arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function · 1d8ab576
      Martin Storsjö authored
      This is cherrypicked from libav commit
      3933b86b.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      1d8ab576
    • Martin Storsjö's avatar
      arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible · 82458955
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 12388 bytes to 19784 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    212.0    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    2102.1   1521.7   1736.2   1265.8
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2104.5   1533.0   1736.6   1265.5
      vp9_inv_dct_dct_16x16_sub8_add_neon:    2484.8   1828.7   2014.4   1506.5
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2851.2   2117.8   2294.8   1753.2
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3239.4   2408.3   2543.5   1994.9
      vp9_inv_dct_dct_32x32_sub1_add_neon:     758.3    456.7    864.5    553.9
      vp9_inv_dct_dct_32x32_sub2_add_neon:   10776.7   7949.8   8567.7   6819.7
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10865.6   8131.5   8589.6   6816.3
      vp9_inv_dct_dct_32x32_sub8_add_neon:   12053.9   9271.3   9387.7   7564.0
      vp9_inv_dct_dct_32x32_sub12_add_neon:  13328.3  10463.2  10217.0   8321.3
      vp9_inv_dct_dct_32x32_sub16_add_neon:  14176.4  11509.5  11018.7   9062.3
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15301.5  12999.9  11855.1   9828.2
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16482.7  14931.5  12650.1  10575.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17589.5  15811.9  13482.8  11333.4
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18696.2  17049.2  14355.6  12089.7
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    211.7    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    1203.5    998.2   1035.3    763.0
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1203.5    998.1   1035.5    760.8
      vp9_inv_dct_dct_16x16_sub8_add_neon:    1926.1   1610.6   1722.1   1271.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2873.2   2129.7   2285.1   1757.3
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3221.4   2520.3   2557.6   2002.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     753.0    457.5    866.6    554.6
      vp9_inv_dct_dct_32x32_sub2_add_neon:    7554.6   5652.4   6048.4   4920.2
      vp9_inv_dct_dct_32x32_sub4_add_neon:    7549.9   5685.0   6046.9   4925.7
      vp9_inv_dct_dct_32x32_sub8_add_neon:    8336.9   6704.5   6604.0   5478.0
      vp9_inv_dct_dct_32x32_sub12_add_neon:  10914.0   9777.2   9240.4   7416.9
      vp9_inv_dct_dct_32x32_sub16_add_neon:  11859.2  11223.3   9966.3   8095.1
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15237.1  13029.4  11838.3   9829.4
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16293.2  14379.8  12644.9  10572.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17424.3  15734.7  13473.0  11326.9
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.3  17457.0  14298.6  12080.0
      
      This is cherrypicked from libav commit
      5eb5aec4.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      82458955
    • Martin Storsjö's avatar
      arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 3bd9b391
      Martin Storsjö authored
      This allows reusing the macro for a separate implementation of the
      pass2 function.
      
      This is cherrypicked from libav commit
      47b3c2c1.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3bd9b391
    • Martin Storsjö's avatar
      arm: vp9itxfm: Make the larger core transforms standalone functions · f8fcee0d
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
      15324 to 12388 bytes.
      
      This gives a small slowdown of a couple tens of cycles, up to around
      150 cycles for the full case of the largest transform, but makes
      it more feasible to add more optimized versions of these transforms.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2063.4   1516.0   1719.5   1245.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3279.3   2454.5   2525.2   1982.3
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10750.0   7955.4   8525.6   6754.2
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18574.0  17108.4  14216.7  12010.2
      
      After:
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2060.8   1608.5   1735.7   1262.0
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.2   2443.5   2546.1   1999.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10682.0   8043.8   8581.3   6810.1
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18522.4  17277.4  14286.7  12087.9
      
      This is cherrypicked from libav commit
      0331c3f5.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      f8fcee0d
    • Martin Storsjö's avatar
      arm: vp9itxfm: Avoid .irp when it doesn't save any lines · 31e41350
      Martin Storsjö authored
      This makes it more readable.
      
      This is cherrypicked from libav commit
      3bc5b28d.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      31e41350
  4. 23 Feb, 2017 3 commits
    • Martin Storsjö's avatar
      arm: vp9itxfm: Reorder iadst16 coeffs · 08074c09
      Martin Storsjö authored
      This matches the order they are in the 16 bpp version.
      
      There they are in this order, to make sure we access them in the
      same order they are declared, easing loading only half of the
      coefficients at a time.
      
      This makes the 8 bpp version match the 16 bpp version better.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      08074c09
    • Martin Storsjö's avatar
      arm: vp9itxfm: Reorder the idct coefficients for better pairing · de06bdfe
      Martin Storsjö authored
      All elements are used pairwise, except for the first one.
      Previously, the 16th element was unused. Move the unused element
      to the second slot, to make the later element pairs not split
      across registers.
      
      This simplifies loading only parts of the coefficients,
      reducing the difference to the 16 bpp version.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      de06bdfe
    • Martin Storsjö's avatar
      arm: vp9itxfm: Avoid reloading the idct32 coefficients · 402546a1
      Martin Storsjö authored
      The idct32x32 function actually pushed q4-q7 onto the stack even
      though it didn't clobber them; there are plenty of registers that
      can be used to allow keeping all the idct coefficients in registers
      without having to reload different subsets of them at different
      stages in the transform.
      
      Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
      q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
      in the idct16 function), and the lanewise vmul needs a register in
      the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
      while doing idct16.
      
      While keeping these coefficients in registers, we still can skip pushing
      q7.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
      After:
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      402546a1
  5. 10 Feb, 2017 1 commit
  6. 09 Feb, 2017 4 commits
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible · 5eb5aec4
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 12388 bytes to 19784 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    212.0    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    2102.1   1521.7   1736.2   1265.8
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2104.5   1533.0   1736.6   1265.5
      vp9_inv_dct_dct_16x16_sub8_add_neon:    2484.8   1828.7   2014.4   1506.5
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2851.2   2117.8   2294.8   1753.2
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3239.4   2408.3   2543.5   1994.9
      vp9_inv_dct_dct_32x32_sub1_add_neon:     758.3    456.7    864.5    553.9
      vp9_inv_dct_dct_32x32_sub2_add_neon:   10776.7   7949.8   8567.7   6819.7
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10865.6   8131.5   8589.6   6816.3
      vp9_inv_dct_dct_32x32_sub8_add_neon:   12053.9   9271.3   9387.7   7564.0
      vp9_inv_dct_dct_32x32_sub12_add_neon:  13328.3  10463.2  10217.0   8321.3
      vp9_inv_dct_dct_32x32_sub16_add_neon:  14176.4  11509.5  11018.7   9062.3
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15301.5  12999.9  11855.1   9828.2
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16482.7  14931.5  12650.1  10575.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17589.5  15811.9  13482.8  11333.4
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18696.2  17049.2  14355.6  12089.7
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    211.7    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    1203.5    998.2   1035.3    763.0
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1203.5    998.1   1035.5    760.8
      vp9_inv_dct_dct_16x16_sub8_add_neon:    1926.1   1610.6   1722.1   1271.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2873.2   2129.7   2285.1   1757.3
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3221.4   2520.3   2557.6   2002.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     753.0    457.5    866.6    554.6
      vp9_inv_dct_dct_32x32_sub2_add_neon:    7554.6   5652.4   6048.4   4920.2
      vp9_inv_dct_dct_32x32_sub4_add_neon:    7549.9   5685.0   6046.9   4925.7
      vp9_inv_dct_dct_32x32_sub8_add_neon:    8336.9   6704.5   6604.0   5478.0
      vp9_inv_dct_dct_32x32_sub12_add_neon:  10914.0   9777.2   9240.4   7416.9
      vp9_inv_dct_dct_32x32_sub16_add_neon:  11859.2  11223.3   9966.3   8095.1
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15237.1  13029.4  11838.3   9829.4
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16293.2  14379.8  12644.9  10572.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17424.3  15734.7  13473.0  11326.9
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.3  17457.0  14298.6  12080.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      5eb5aec4
    • Martin Storsjö's avatar
      arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 47b3c2c1
      Martin Storsjö authored
      This allows reusing the macro for a separate implementation of the
      pass2 function.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      47b3c2c1
    • Martin Storsjö's avatar
      arm: vp9itxfm: Make the larger core transforms standalone functions · 0331c3f5
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
      15324 to 12388 bytes.
      
      This gives a small slowdown of a couple tens of cycles, up to around
      150 cycles for the full case of the largest transform, but makes
      it more feasible to add more optimized versions of these transforms.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2063.4   1516.0   1719.5   1245.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3279.3   2454.5   2525.2   1982.3
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10750.0   7955.4   8525.6   6754.2
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18574.0  17108.4  14216.7  12010.2
      
      After:
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2060.8   1608.5   1735.7   1262.0
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.2   2443.5   2546.1   1999.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10682.0   8043.8   8581.3   6810.1
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18522.4  17277.4  14286.7  12087.9
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      0331c3f5
  7. 05 Feb, 2017 1 commit
  8. 14 Jan, 2017 5 commits
    • Martin Storsjö's avatar
      arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · 388f6e67
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      Previously all subpartitions except the eob=1 (DC) case ran with
      the same runtime:
      
                                           Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3188.1   2435.4   2499.0   1969.0
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.7  16582.3  14207.6  12000.3
      
      By skipping individual 4x16 or 4x32 pixel slices in the first pass,
      we reduce the runtime of these functions like this:
      
      vp9_inv_dct_dct_16x16_sub1_add_neon:     274.6    189.5    211.7    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    2064.0   1534.8   1719.4   1248.7
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2135.0   1477.2   1736.3   1249.5
      vp9_inv_dct_dct_16x16_sub8_add_neon:    2446.7   1828.7   1993.6   1494.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2832.4   2118.3   2266.5   1735.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.7   2475.3   2523.5   1983.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     756.2    456.7    862.0    553.9
      vp9_inv_dct_dct_32x32_sub2_add_neon:   10682.2   8190.4   8539.2   6762.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10813.5   8014.9   8518.3   6762.8
      vp9_inv_dct_dct_32x32_sub8_add_neon:   11859.6   9313.0   9347.4   7514.5
      vp9_inv_dct_dct_32x32_sub12_add_neon:  12946.6  10752.4  10192.2   8280.2
      vp9_inv_dct_dct_32x32_sub16_add_neon:  14074.6  11946.5  11001.4   9008.6
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15269.9  13662.7  11816.1   9762.6
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16327.9  14940.1  12626.7  10516.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17462.7  15776.1  13446.2  11264.7
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18575.5  17157.0  14249.3  12015.1
      
      I.e. in general a very minor overhead for the full subpartition case due
      to the additional loads and cmps, but a significant speedup for the cases
      when we only need to process a small part of the actual input data.
      
      In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
      16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
      8x8 or 16x16 subpartitions respectively.
      
      This is cherrypicked from libav commit
      9c8bc74c.
      Signed-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
      388f6e67
    • Martin Storsjö's avatar
      arm: vp9itxfm: Only reload the idct coeffs for the iadst_idct combination · ecd343aa
      Martin Storsjö authored
      This avoids reloading them if they haven't been clobbered, if the
      first pass also was idct.
      
      This is similar to what was done in the aarch64 version.
      
      This is cherrypicked from libav commit
      3c87039a.
      Signed-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
      ecd343aa
    • Martin Storsjö's avatar
      arm: vp9itxfm: Rename a macro parameter to fit better · f69dd26d
      Martin Storsjö authored
      Since the same parameter is used for both input and output,
      the name inout is more fitting.
      
      This matches the naming used below in the dmbutterfly macro.
      
      This is cherrypicked from libav commit
      79566ec8.
      Signed-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
      f69dd26d
    • Martin Storsjö's avatar
      arm/aarch64: vp9itxfm: Fix indentation of macro arguments · 4a5874ea
      Martin Storsjö authored
      This is cherrypicked from libav commit
      721bc375.
      Signed-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
      4a5874ea
    • Janne Grunau's avatar
      arm: vp9itxfm: Simplify the stack alignment code · a71cd843
      Janne Grunau authored
      This is one instruction less for thumb, and only have got
      1/2 arm/thumb specific instructions.
      
      This is cherrypicked from libav commit
      e5b0fc17.
      Signed-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
      a71cd843
  9. 30 Nov, 2016 2 commits
    • Martin Storsjö's avatar
      arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · 9c8bc74c
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      Previously all subpartitions except the eob=1 (DC) case ran with
      the same runtime:
      
                                           Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3188.1   2435.4   2499.0   1969.0
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.7  16582.3  14207.6  12000.3
      
      By skipping individual 4x16 or 4x32 pixel slices in the first pass,
      we reduce the runtime of these functions like this:
      
      vp9_inv_dct_dct_16x16_sub1_add_neon:     274.6    189.5    211.7    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    2064.0   1534.8   1719.4   1248.7
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2135.0   1477.2   1736.3   1249.5
      vp9_inv_dct_dct_16x16_sub8_add_neon:    2446.7   1828.7   1993.6   1494.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2832.4   2118.3   2266.5   1735.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.7   2475.3   2523.5   1983.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     756.2    456.7    862.0    553.9
      vp9_inv_dct_dct_32x32_sub2_add_neon:   10682.2   8190.4   8539.2   6762.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10813.5   8014.9   8518.3   6762.8
      vp9_inv_dct_dct_32x32_sub8_add_neon:   11859.6   9313.0   9347.4   7514.5
      vp9_inv_dct_dct_32x32_sub12_add_neon:  12946.6  10752.4  10192.2   8280.2
      vp9_inv_dct_dct_32x32_sub16_add_neon:  14074.6  11946.5  11001.4   9008.6
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15269.9  13662.7  11816.1   9762.6
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16327.9  14940.1  12626.7  10516.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17462.7  15776.1  13446.2  11264.7
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18575.5  17157.0  14249.3  12015.1
      
      I.e. in general a very minor overhead for the full subpartition case due
      to the additional loads and cmps, but a significant speedup for the cases
      when we only need to process a small part of the actual input data.
      
      In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
      16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
      8x8 or 16x16 subpartitions respectively.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      9c8bc74c
    • Martin Storsjö's avatar
      arm: vp9itxfm: Only reload the idct coeffs for the iadst_idct combination · 3c87039a
      Martin Storsjö authored
      This avoids reloading them if they haven't been clobbered, if the
      first pass also was idct.
      
      This is similar to what was done in the aarch64 version.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3c87039a
  10. 23 Nov, 2016 2 commits
  11. 18 Nov, 2016 1 commit
  12. 15 Nov, 2016 1 commit
    • Martin Storsjö's avatar
      arm: vp9: Add NEON itxfm routines · b4dc7c34
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      For the transforms up to 8x8, we can fit all the data (including
      temporaries) in registers and just do a straightforward transform
      of all the data. For 16x16, we do a transform of 4x16 pixels in
      4 slices, using a temporary buffer. For 32x32, we transform 4x32
      pixels at a time, in two steps of 4x16 pixels each.
      
      Examples of relative speedup compared to the C version, from checkasm:
                               Cortex       A7     A8     A9    A53
      vp9_inv_adst_adst_4x4_add_neon:     3.39   5.83   4.17   4.01
      vp9_inv_adst_adst_8x8_add_neon:     3.79   4.86   4.23   3.98
      vp9_inv_adst_adst_16x16_add_neon:   3.33   4.36   4.11   4.16
      vp9_inv_dct_dct_4x4_add_neon:       4.06   6.16   4.59   4.46
      vp9_inv_dct_dct_8x8_add_neon:       4.61   6.01   4.98   4.86
      vp9_inv_dct_dct_16x16_add_neon:     3.35   3.44   3.36   3.79
      vp9_inv_dct_dct_32x32_add_neon:     3.89   3.50   3.79   4.42
      vp9_inv_wht_wht_4x4_add_neon:       3.22   5.13   3.53   3.77
      
      Thus, the speedup vs C code is around 3-6x.
      
      This is mostly marginally faster than the corresponding routines
      in libvpx on most cores, tested with their 32x32 idct (compared to
      vpx_idct32x32_1024_add_neon). These numbers are slightly in libvpx's
      favour since their version doesn't clear the input buffer like ours
      do (although the effect of that on the total runtime probably is
      negligible.)
      
                                 Cortex       A7       A8       A9      A53
      vp9_inv_dct_dct_32x32_add_neon:    18436.8  16874.1  14235.1  11988.9
      libvpx vpx_idct32x32_1024_add_neon 20789.0  13344.3  15049.9  13030.5
      
      Only on the Cortex A8, the libvpx function is faster. On the other cores,
      ours is slightly faster even though ours has got source block clearing
      integrated.
      
      This is an adapted cherry-pick from libav commits
      a67ae670 and
      52d196fb.
      Signed-off-by: 's avatarRonald S. Bultje <rsbultje@gmail.com>
      b4dc7c34
  13. 13 Nov, 2016 1 commit
  14. 11 Nov, 2016 1 commit
    • Martin Storsjö's avatar
      arm: vp9: Add NEON itxfm routines · a67ae670
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      For the transforms up to 8x8, we can fit all the data (including
      temporaries) in registers and just do a straightforward transform
      of all the data. For 16x16, we do a transform of 4x16 pixels in
      4 slices, using a temporary buffer. For 32x32, we transform 4x32
      pixels at a time, in two steps of 4x16 pixels each.
      
      Examples of relative speedup compared to the C version, from checkasm:
                               Cortex       A7     A8     A9    A53
      vp9_inv_adst_adst_4x4_add_neon:     3.39   5.83   4.17   4.01
      vp9_inv_adst_adst_8x8_add_neon:     3.79   4.86   4.23   3.98
      vp9_inv_adst_adst_16x16_add_neon:   3.33   4.36   4.11   4.16
      vp9_inv_dct_dct_4x4_add_neon:       4.06   6.16   4.59   4.46
      vp9_inv_dct_dct_8x8_add_neon:       4.61   6.01   4.98   4.86
      vp9_inv_dct_dct_16x16_add_neon:     3.35   3.44   3.36   3.79
      vp9_inv_dct_dct_32x32_add_neon:     3.89   3.50   3.79   4.42
      vp9_inv_wht_wht_4x4_add_neon:       3.22   5.13   3.53   3.77
      
      Thus, the speedup vs C code is around 3-6x.
      
      This is mostly marginally faster than the corresponding routines
      in libvpx on most cores, tested with their 32x32 idct (compared to
      vpx_idct32x32_1024_add_neon). These numbers are slightly in libvpx's
      favour since their version doesn't clear the input buffer like ours
      do (although the effect of that on the total runtime probably is
      negligible.)
      
                                 Cortex       A7       A8       A9      A53
      vp9_inv_dct_dct_32x32_add_neon:    18436.8  16874.1  14235.1  11988.9
      libvpx vpx_idct32x32_1024_add_neon 20789.0  13344.3  15049.9  13030.5
      
      Only on the Cortex A8, the libvpx function is faster. On the other cores,
      ours is slightly faster even though ours has got source block clearing
      integrated.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      a67ae670