1. 18 Oct, 2017 1 commit
  2. 20 Jun, 2017 1 commit
  3. 16 Mar, 2017 1 commit
  4. 11 Mar, 2017 1 commit
  5. 23 Feb, 2017 5 commits
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Reorder iadst16 coeffs · b8f66c08
      Martin Storsjö authored
      This matches the order they are in the 16 bpp version.
      
      There they are in this order, to make sure we access them in the
      same order they are declared, easing loading only half of the
      coefficients at a time.
      
      This makes the 8 bpp version match the 16 bpp version better.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b8f66c08
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Reorder the idct coefficients for better pairing · 09eb88a1
      Martin Storsjö authored
      All elements are used pairwise, except for the first one.
      Previously, the 16th element was unused. Move the unused element
      to the second slot, to make the later element pairs not split
      across registers.
      
      This simplifies loading only parts of the coefficients,
      reducing the difference to the 16 bpp version.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      09eb88a1
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Avoid reloading the idct32 coefficients · 65aa002d
      Martin Storsjö authored
      The idct32x32 function actually pushed d8-d15 onto the stack even
      though it didn't clobber them; there are plenty of registers that
      can be used to allow keeping all the idct coefficients in registers
      without having to reload different subsets of them at different
      stages in the transform.
      
      After this, we still can skip pushing d12-d15.
      
      Before:
      vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
      After:
      vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      65aa002d
    • Martin Storsjö's avatar
      aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1 · 3bf9c483
      Martin Storsjö authored
      This is one cycle faster in total, and three instructions fewer.
      
      Before:
      vp9_loop_filter_mix2_v_44_16_neon: 123.2
      After:
      vp9_loop_filter_mix2_v_44_16_neon: 122.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3bf9c483
    • Martin Storsjö's avatar
      arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit · c582cb85
      Martin Storsjö authored
      The theoretical maximum value of E is 193, so we can just
      saturate the addition to 255.
      
      Before:                     Cortex A7      A8      A9     A53  A53/AArch64
      vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
      vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
      After:
      vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
      vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
      vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
      vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      c582cb85
  6. 12 Feb, 2017 1 commit
  7. 11 Feb, 2017 1 commit
  8. 10 Feb, 2017 4 commits
  9. 09 Feb, 2017 8 commits
  10. 05 Feb, 2017 1 commit
  11. 03 Jan, 2017 2 commits
  12. 19 Dec, 2016 1 commit
  13. 14 Dec, 2016 1 commit
  14. 30 Nov, 2016 1 commit
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · cad42fad
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      Previously all subpartitions except the eob=1 (DC) case ran with
      the same runtime:
      
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1373.2
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8089.0
      
      By skipping individual 8x16 or 8x32 pixel slices in the first pass,
      we reduce the runtime of these functions like this:
      
      vp9_inv_dct_dct_16x16_sub1_add_neon:     235.3
      vp9_inv_dct_dct_16x16_sub2_add_neon:    1036.7
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
      vp9_inv_dct_dct_16x16_sub8_add_neon:    1036.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   1372.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     555.1
      vp9_inv_dct_dct_32x32_sub2_add_neon:    5190.2
      vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
      vp9_inv_dct_dct_32x32_sub8_add_neon:    5183.1
      vp9_inv_dct_dct_32x32_sub12_add_neon:   6161.5
      vp9_inv_dct_dct_32x32_sub16_add_neon:   6155.5
      vp9_inv_dct_dct_32x32_sub20_add_neon:   7136.3
      vp9_inv_dct_dct_32x32_sub24_add_neon:   7128.4
      vp9_inv_dct_dct_32x32_sub28_add_neon:   8098.9
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8098.8
      
      I.e. in general a very minor overhead for the full subpartition case due
      to the additional cmps, but a significant speedup for the cases when we
      only need to process a small part of the actual input data.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      cad42fad
  15. 24 Nov, 2016 1 commit
  16. 23 Nov, 2016 1 commit
  17. 18 Nov, 2016 1 commit
  18. 16 Nov, 2016 2 commits
  19. 14 Nov, 2016 1 commit
  20. 13 Nov, 2016 2 commits
    • Martin Storsjö's avatar
      aarch64: vp9: Implement NEON loop filters · 9d2afd1e
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      These are ported from the ARM version; thanks to the larger
      amount of registers available, we can do the loop filters with
      16 pixels at a time. The implementation is fully templated, with
      a single macro which can generate versions for both 8 and
      16 pixels wide, for both 4, 8 and 16 pixels loop filters
      (and the 4/8 mixed versions as well).
      
      For the 8 pixel wide versions, it is pretty close in speed (the
      v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
      and h_8_8 filters seem to get some gain in the load/transpose/store
      part). For the 16 pixels wide ones, we get a speedup of around
      1.2-1.4x compared to the 32 bit version.
      
      Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                             ARM AArch64
      vp9_loop_filter_h_4_8_neon:          144.0   127.2
      vp9_loop_filter_h_8_8_neon:          207.0   182.5
      vp9_loop_filter_h_16_8_neon:         415.0   328.7
      vp9_loop_filter_h_16_16_neon:        672.0   558.6
      vp9_loop_filter_mix2_h_44_16_neon:   302.0   203.5
      vp9_loop_filter_mix2_h_48_16_neon:   365.0   305.2
      vp9_loop_filter_mix2_h_84_16_neon:   365.0   305.2
      vp9_loop_filter_mix2_h_88_16_neon:   376.0   305.2
      vp9_loop_filter_mix2_v_44_16_neon:   193.2   128.2
      vp9_loop_filter_mix2_v_48_16_neon:   246.7   218.4
      vp9_loop_filter_mix2_v_84_16_neon:   248.0   218.5
      vp9_loop_filter_mix2_v_88_16_neon:   302.0   218.2
      vp9_loop_filter_v_4_8_neon:           89.0    88.7
      vp9_loop_filter_v_8_8_neon:          141.0   137.7
      vp9_loop_filter_v_16_8_neon:         295.0   272.7
      vp9_loop_filter_v_16_16_neon:        546.0   453.7
      
      The speedup vs C code in checkasm tests is around 2-7x, which is
      pretty much the same as for the 32 bit version. Even if these functions
      are faster than their 32 bit equivalent, the C version that we compare
      to also became around 1.3-1.7x faster than the C version in 32 bit.
      
      Based on START_TIMER/STOP_TIMER wrapping around a few individual
      functions, the speedup vs C code is around 4-5x.
      
      Examples of runtimes vs C on a Cortex A57 (for a slightly older version
      of the patch):
                               A57 gcc-5.3  neon
      loop_filter_h_4_8_neon:        256.6  93.4
      loop_filter_h_8_8_neon:        307.3 139.1
      loop_filter_h_16_8_neon:       340.1 254.1
      loop_filter_h_16_16_neon:      827.0 407.9
      loop_filter_mix2_h_44_16_neon: 524.5 155.4
      loop_filter_mix2_h_48_16_neon: 644.5 173.3
      loop_filter_mix2_h_84_16_neon: 630.5 222.0
      loop_filter_mix2_h_88_16_neon: 697.3 222.0
      loop_filter_mix2_v_44_16_neon: 598.5 100.6
      loop_filter_mix2_v_48_16_neon: 651.5 127.0
      loop_filter_mix2_v_84_16_neon: 591.5 167.1
      loop_filter_mix2_v_88_16_neon: 855.1 166.7
      loop_filter_v_4_8_neon:        271.7  65.3
      loop_filter_v_8_8_neon:        312.5 106.9
      loop_filter_v_16_8_neon:       473.3 206.5
      loop_filter_v_16_16_neon:      976.1 327.8
      
      The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
      is again 30-50% faster than the cortex-a53.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      9d2afd1e
    • Martin Storsjö's avatar
      aarch64: vp9: Add NEON itxfm routines · 3c9546df
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      These are ported from the ARM version; thanks to the larger
      amount of registers available, we can do the 16x16 and 32x32
      transforms in slices 8 pixels wide instead of 4. This gives
      a speedup of around 1.4x compared to the 32 bit version.
      
      The fact that aarch64 doesn't have the same d/q register
      aliasing makes some of the macros quite a bit simpler as well.
      
      Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                             ARM  AArch64
      vp9_inv_adst_adst_4x4_add_neon:       90.0     87.7
      vp9_inv_adst_adst_8x8_add_neon:      400.0    354.7
      vp9_inv_adst_adst_16x16_add_neon:   2526.5   1827.2
      vp9_inv_dct_dct_4x4_add_neon:         74.0     72.7
      vp9_inv_dct_dct_8x8_add_neon:        271.0    256.7
      vp9_inv_dct_dct_16x16_add_neon:     1960.7   1372.7
      vp9_inv_dct_dct_32x32_add_neon:    11988.9   8088.3
      vp9_inv_wht_wht_4x4_add_neon:         63.0     57.7
      
      The speedup vs C code (2-4x) is smaller than in the 32 bit case,
      mostly because the C code ends up significantly faster (around
      1.6x faster, with GCC 5.4) when built for aarch64.
      
      Examples of runtimes vs C on a Cortex A57 (for a slightly older version
      of the patch):
                                      A57 gcc-5.3   neon
      vp9_inv_adst_adst_4x4_add_neon:       152.2   60.0
      vp9_inv_adst_adst_8x8_add_neon:       948.2  288.0
      vp9_inv_adst_adst_16x16_add_neon:    4830.4 1380.5
      vp9_inv_dct_dct_4x4_add_neon:         153.0   58.6
      vp9_inv_dct_dct_8x8_add_neon:         789.2  180.2
      vp9_inv_dct_dct_16x16_add_neon:      3639.6  917.1
      vp9_inv_dct_dct_32x32_add_neon:     20462.1 4985.0
      vp9_inv_wht_wht_4x4_add_neon:          91.0   49.8
      
      The asm is around factor 3-4 faster than C on the cortex-a57 and the asm
      is around 30-50% faster on the a57 compared to the a53.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3c9546df
  21. 10 Nov, 2016 2 commits
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: vp9: Add NEON optimizations of VP9 MC functions · 383d96aa
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      These are ported from the ARM version; it is essentially a 1:1
      port with no extra added features, but with some hand tuning
      (especially for the plain copy/avg functions). The ARM version
      isn't very register starved to begin with, so there's not much
      to be gained from having more spare registers here - we only
      avoid having to clobber callee-saved registers.
      
      Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                           ARM   AArch64
      vp9_avg4_neon:                      27.2      23.7
      vp9_avg8_neon:                      56.5      54.7
      vp9_avg16_neon:                    169.9     167.4
      vp9_avg32_neon:                    585.8     585.2
      vp9_avg64_neon:                   2460.3    2294.7
      vp9_avg_8tap_smooth_4h_neon:       132.7     125.2
      vp9_avg_8tap_smooth_4hv_neon:      478.8     442.0
      vp9_avg_8tap_smooth_4v_neon:       126.0      93.7
      vp9_avg_8tap_smooth_8h_neon:       241.7     234.2
      vp9_avg_8tap_smooth_8hv_neon:      690.9     646.5
      vp9_avg_8tap_smooth_8v_neon:       245.0     205.5
      vp9_avg_8tap_smooth_64h_neon:    11273.2   11280.1
      vp9_avg_8tap_smooth_64hv_neon:   22980.6   22184.1
      vp9_avg_8tap_smooth_64v_neon:    11549.7   10781.1
      vp9_put4_neon:                      18.0      17.2
      vp9_put8_neon:                      40.2      37.7
      vp9_put16_neon:                     97.4      99.5
      vp9_put32_neon/armv8:              346.0     307.4
      vp9_put64_neon/armv8:             1319.0    1107.5
      vp9_put_8tap_smooth_4h_neon:       126.7     118.2
      vp9_put_8tap_smooth_4hv_neon:      465.7     434.0
      vp9_put_8tap_smooth_4v_neon:       113.0      86.5
      vp9_put_8tap_smooth_8h_neon:       229.7     221.6
      vp9_put_8tap_smooth_8hv_neon:      658.9     621.3
      vp9_put_8tap_smooth_8v_neon:       215.0     187.5
      vp9_put_8tap_smooth_64h_neon:    10636.7   10627.8
      vp9_put_8tap_smooth_64hv_neon:   21076.8   21026.9
      vp9_put_8tap_smooth_64v_neon:     9635.0    9632.4
      
      These are generally about as fast as the corresponding ARM
      routines on the same CPU (at least on the A53), in most cases
      marginally faster.
      
      The speedup vs C code is pretty much the same as for the 32 bit
      case; on the A53 it's around 6-13x for ther larger 8tap filters.
      The exact speedup varies a little, since the C versions generally
      don't end up exactly as slow/fast as on 32 bit.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      383d96aa
  22. 09 Nov, 2016 1 commit