1. 19 Mar, 2017 5 commits
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible · eabc5abf
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 14516 bytes to 22484 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:                                 Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    270.7    418.5    295.4
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:    3840.2   3244.8   3700.1   2337.9
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4212.5   3575.4   3996.9   2571.6
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:    5174.4   4270.5   4615.5   3031.9
      vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5676.0   4908.5   5226.5   3491.3
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6403.9   5589.0   5839.8   3948.5
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1710.7    944.7   1582.1   1045.4
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:   21040.7  16706.1  18687.7  13193.1
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22197.7  18282.7  19577.5  13918.6
      vp9_inv_dct_dct_32x32_sub8_add_10_neon:   24511.5  20911.5  21472.5  15367.5
      vp9_inv_dct_dct_32x32_sub12_add_10_neon:  26939.5  24264.3  23239.1  16830.3
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:  29419.5  26845.1  25020.6  18259.9
      vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31146.4  29633.5  26803.3  19721.7
      vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33376.3  32507.8  28642.4  21174.2
      vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35629.4  35439.6  30416.5  22625.7
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37269.9  37914.9  32271.9  24078.9
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    276.0    418.5    295.1
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:    2336.2   1886.0   2251.0   1458.6
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    2531.0   2054.7   2402.8   1591.1
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:    3848.6   3491.1   3845.7   2554.8
      vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5703.8   4831.6   5230.8   3493.4
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6399.5   5567.0   5832.4   3951.5
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1722.1    938.5   1577.3   1044.5
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:   15003.5  11576.8  13105.8   9602.2
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   15768.5  12677.2  13726.0  10138.1
      vp9_inv_dct_dct_32x32_sub8_add_10_neon:   17278.8  14825.4  14907.5  11185.7
      vp9_inv_dct_dct_32x32_sub12_add_10_neon:  22335.7  21544.5  20379.5  15019.8
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:  24165.6  23881.7  21938.6  16308.2
      vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31082.2  30860.9  26835.3  19711.3
      vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33102.6  31922.8  28638.3  21161.0
      vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35104.9  34867.5  30411.7  22621.2
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37438.1  39103.4  32217.8  24067.6
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      eabc5abf
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Make the larger core transforms standalone functions · 0ea60320
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from
      17500 to 14516 bytes.
      
      This gives a small slowdown of a couple tens of cycles, up to around
      150 cycles for the full case of the largest transform, but makes
      it more feasible to add more optimized versions of these transforms.
      
      Before:                                 Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4237.4   3561.5   3971.8   2525.3
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6371.9   5452.0   5779.3   3910.5
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22068.8  17867.5  19555.2  13871.6
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37268.9  38684.2  32314.2  23969.0
      
      After:
      vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4375.1   3571.9   4283.8   2567.2
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6415.6   5578.9   5844.6   3948.3
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22653.7  18079.7  19603.7  13905.3
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37593.2  38862.2  32235.8  24070.9
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      0ea60320
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Avoid reloading the idct32 coefficients · 32e273c1
      Martin Storsjö authored
      Keep the idct32 coefficients in narrow form in q6-q7, and idct16
      coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering
      q0-q3 in the pass1 function, and squeeze the idct16 coefficients
      into q0-q1 in the pass2 function to avoid reloading them.
      
      The idct16 coefficients are clobbered and reloaded within idct32_odd
      though, since that turns out to be faster than narrowing them and
      swapping them into q6-q7.
      
      Before:                            Cortex       A7        A8        A9      A53
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22653.8   18268.4   19598.0  14079.0
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37699.0   38665.2   32542.3  24472.2
      After:
      vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22270.8   18159.3   19531.0  13865.0
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37523.3   37731.6   32181.7  24071.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      32e273c1
    • Martin Storsjö's avatar
      c1619318
    • Martin Storsjö's avatar
      arm: vp9itxfm16: Use the right lane size · b46d37e9
      Martin Storsjö authored
      This makes the code slightly clearer, but doesn't make any functional
      difference.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b46d37e9
  2. 24 Jan, 2017 1 commit
    • Martin Storsjö's avatar
      arm: Add NEON optimizations for 10 and 12 bit vp9 itxfm · 2ed67eba
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This is structured similarly to the 8 bit version. In the 8 bit
      version, the coefficients are 16 bits, and intermediates are 32 bits.
      
      Here, the coefficients are 32 bit. For the 4x4 transforms for 10 bit
      content, the intermediates also fit in 32 bits, but for all other
      transforms (4x4 for 12 bit content, and 8x8 and larger for both 10
      and 12 bit) the intermediates are 64 bit.
      
      For the existing 8 bit case, the 8x8 transform fit all coefficients in
      registers; for 10/12 bit, when the coefficients are 32 bit, the 8x8
      transform also has to be done in slices of 4 pixels (just as 16x16 and
      32x32 for 8 bit).
      
      The slice width also shrinks from 4 elements to 2 elements in parallel
      for the 16x16 and 32x32 cases.
      
      The 16 bit coefficients from idct_coeffs and similar tables also need
      to be lenghtened to 32 bit in order to be used in multiplication with
      vectors with 32 bit elements. This leads to the fixed coefficient
      vectors needing more space, leading to more cases where they have to
      be reloaded within the transform (in iadst16).
      
      This technically would need testing in checkasm for subpartitions
      in increments of 2, but that slows down normal checkasm runs
      excessively.
      
      Examples of relative speedup compared to the C version, from checkasm:
                                           Cortex    A7     A8     A9    A53
      vp9_inv_adst_adst_4x4_sub4_add_10_neon:      4.83  11.36   5.22   6.77
      vp9_inv_adst_adst_8x8_sub8_add_10_neon:      4.12   7.60   4.06   4.84
      vp9_inv_adst_adst_16x16_sub16_add_10_neon:   3.93   8.16   4.52   5.35
      vp9_inv_dct_dct_4x4_sub1_add_10_neon:        1.36   2.57   1.41   1.61
      vp9_inv_dct_dct_4x4_sub4_add_10_neon:        4.24   8.66   5.06   5.81
      vp9_inv_dct_dct_8x8_sub1_add_10_neon:        2.63   4.18   1.68   2.87
      vp9_inv_dct_dct_8x8_sub4_add_10_neon:        4.52   9.47   4.24   5.39
      vp9_inv_dct_dct_8x8_sub8_add_10_neon:        3.45   7.34   3.45   4.30
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:      3.56   6.21   2.47   4.32
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:      5.68  12.73   5.28   7.07
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:      4.42   9.28   4.24   5.45
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:     3.41   7.29   3.35   4.19
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:      4.52   8.35   3.83   6.40
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:      5.86  13.19   6.14   7.04
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:     4.29   8.11   4.59   5.06
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:     3.31   5.70   3.56   3.84
      vp9_inv_wht_wht_4x4_sub4_add_10_neon:        1.89   2.80   1.82   1.97
      
      The speedup compared to the C functions is around 1.3 to 7x for the
      full transforms, even higher for the smaller subpartitions.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      2ed67eba