1. 13 Nov, 2016 2 commits
    • Martin Storsjö's avatar
      aarch64: vp9: Implement NEON loop filters · 9d2afd1e
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      These are ported from the ARM version; thanks to the larger
      amount of registers available, we can do the loop filters with
      16 pixels at a time. The implementation is fully templated, with
      a single macro which can generate versions for both 8 and
      16 pixels wide, for both 4, 8 and 16 pixels loop filters
      (and the 4/8 mixed versions as well).
      
      For the 8 pixel wide versions, it is pretty close in speed (the
      v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
      and h_8_8 filters seem to get some gain in the load/transpose/store
      part). For the 16 pixels wide ones, we get a speedup of around
      1.2-1.4x compared to the 32 bit version.
      
      Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                             ARM AArch64
      vp9_loop_filter_h_4_8_neon:          144.0   127.2
      vp9_loop_filter_h_8_8_neon:          207.0   182.5
      vp9_loop_filter_h_16_8_neon:         415.0   328.7
      vp9_loop_filter_h_16_16_neon:        672.0   558.6
      vp9_loop_filter_mix2_h_44_16_neon:   302.0   203.5
      vp9_loop_filter_mix2_h_48_16_neon:   365.0   305.2
      vp9_loop_filter_mix2_h_84_16_neon:   365.0   305.2
      vp9_loop_filter_mix2_h_88_16_neon:   376.0   305.2
      vp9_loop_filter_mix2_v_44_16_neon:   193.2   128.2
      vp9_loop_filter_mix2_v_48_16_neon:   246.7   218.4
      vp9_loop_filter_mix2_v_84_16_neon:   248.0   218.5
      vp9_loop_filter_mix2_v_88_16_neon:   302.0   218.2
      vp9_loop_filter_v_4_8_neon:           89.0    88.7
      vp9_loop_filter_v_8_8_neon:          141.0   137.7
      vp9_loop_filter_v_16_8_neon:         295.0   272.7
      vp9_loop_filter_v_16_16_neon:        546.0   453.7
      
      The speedup vs C code in checkasm tests is around 2-7x, which is
      pretty much the same as for the 32 bit version. Even if these functions
      are faster than their 32 bit equivalent, the C version that we compare
      to also became around 1.3-1.7x faster than the C version in 32 bit.
      
      Based on START_TIMER/STOP_TIMER wrapping around a few individual
      functions, the speedup vs C code is around 4-5x.
      
      Examples of runtimes vs C on a Cortex A57 (for a slightly older version
      of the patch):
                               A57 gcc-5.3  neon
      loop_filter_h_4_8_neon:        256.6  93.4
      loop_filter_h_8_8_neon:        307.3 139.1
      loop_filter_h_16_8_neon:       340.1 254.1
      loop_filter_h_16_16_neon:      827.0 407.9
      loop_filter_mix2_h_44_16_neon: 524.5 155.4
      loop_filter_mix2_h_48_16_neon: 644.5 173.3
      loop_filter_mix2_h_84_16_neon: 630.5 222.0
      loop_filter_mix2_h_88_16_neon: 697.3 222.0
      loop_filter_mix2_v_44_16_neon: 598.5 100.6
      loop_filter_mix2_v_48_16_neon: 651.5 127.0
      loop_filter_mix2_v_84_16_neon: 591.5 167.1
      loop_filter_mix2_v_88_16_neon: 855.1 166.7
      loop_filter_v_4_8_neon:        271.7  65.3
      loop_filter_v_8_8_neon:        312.5 106.9
      loop_filter_v_16_8_neon:       473.3 206.5
      loop_filter_v_16_16_neon:      976.1 327.8
      
      The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
      is again 30-50% faster than the cortex-a53.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      9d2afd1e
    • Martin Storsjö's avatar
      aarch64: vp9: Add NEON itxfm routines · 3c9546df
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      These are ported from the ARM version; thanks to the larger
      amount of registers available, we can do the 16x16 and 32x32
      transforms in slices 8 pixels wide instead of 4. This gives
      a speedup of around 1.4x compared to the 32 bit version.
      
      The fact that aarch64 doesn't have the same d/q register
      aliasing makes some of the macros quite a bit simpler as well.
      
      Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                             ARM  AArch64
      vp9_inv_adst_adst_4x4_add_neon:       90.0     87.7
      vp9_inv_adst_adst_8x8_add_neon:      400.0    354.7
      vp9_inv_adst_adst_16x16_add_neon:   2526.5   1827.2
      vp9_inv_dct_dct_4x4_add_neon:         74.0     72.7
      vp9_inv_dct_dct_8x8_add_neon:        271.0    256.7
      vp9_inv_dct_dct_16x16_add_neon:     1960.7   1372.7
      vp9_inv_dct_dct_32x32_add_neon:    11988.9   8088.3
      vp9_inv_wht_wht_4x4_add_neon:         63.0     57.7
      
      The speedup vs C code (2-4x) is smaller than in the 32 bit case,
      mostly because the C code ends up significantly faster (around
      1.6x faster, with GCC 5.4) when built for aarch64.
      
      Examples of runtimes vs C on a Cortex A57 (for a slightly older version
      of the patch):
                                      A57 gcc-5.3   neon
      vp9_inv_adst_adst_4x4_add_neon:       152.2   60.0
      vp9_inv_adst_adst_8x8_add_neon:       948.2  288.0
      vp9_inv_adst_adst_16x16_add_neon:    4830.4 1380.5
      vp9_inv_dct_dct_4x4_add_neon:         153.0   58.6
      vp9_inv_dct_dct_8x8_add_neon:         789.2  180.2
      vp9_inv_dct_dct_16x16_add_neon:      3639.6  917.1
      vp9_inv_dct_dct_32x32_add_neon:     20462.1 4985.0
      vp9_inv_wht_wht_4x4_add_neon:          91.0   49.8
      
      The asm is around factor 3-4 faster than C on the cortex-a57 and the asm
      is around 30-50% faster on the a57 compared to the a53.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3c9546df
  2. 10 Nov, 2016 1 commit
    • Martin Storsjö's avatar
      aarch64: vp9: Add NEON optimizations of VP9 MC functions · 383d96aa
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      These are ported from the ARM version; it is essentially a 1:1
      port with no extra added features, but with some hand tuning
      (especially for the plain copy/avg functions). The ARM version
      isn't very register starved to begin with, so there's not much
      to be gained from having more spare registers here - we only
      avoid having to clobber callee-saved registers.
      
      Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                           ARM   AArch64
      vp9_avg4_neon:                      27.2      23.7
      vp9_avg8_neon:                      56.5      54.7
      vp9_avg16_neon:                    169.9     167.4
      vp9_avg32_neon:                    585.8     585.2
      vp9_avg64_neon:                   2460.3    2294.7
      vp9_avg_8tap_smooth_4h_neon:       132.7     125.2
      vp9_avg_8tap_smooth_4hv_neon:      478.8     442.0
      vp9_avg_8tap_smooth_4v_neon:       126.0      93.7
      vp9_avg_8tap_smooth_8h_neon:       241.7     234.2
      vp9_avg_8tap_smooth_8hv_neon:      690.9     646.5
      vp9_avg_8tap_smooth_8v_neon:       245.0     205.5
      vp9_avg_8tap_smooth_64h_neon:    11273.2   11280.1
      vp9_avg_8tap_smooth_64hv_neon:   22980.6   22184.1
      vp9_avg_8tap_smooth_64v_neon:    11549.7   10781.1
      vp9_put4_neon:                      18.0      17.2
      vp9_put8_neon:                      40.2      37.7
      vp9_put16_neon:                     97.4      99.5
      vp9_put32_neon/armv8:              346.0     307.4
      vp9_put64_neon/armv8:             1319.0    1107.5
      vp9_put_8tap_smooth_4h_neon:       126.7     118.2
      vp9_put_8tap_smooth_4hv_neon:      465.7     434.0
      vp9_put_8tap_smooth_4v_neon:       113.0      86.5
      vp9_put_8tap_smooth_8h_neon:       229.7     221.6
      vp9_put_8tap_smooth_8hv_neon:      658.9     621.3
      vp9_put_8tap_smooth_8v_neon:       215.0     187.5
      vp9_put_8tap_smooth_64h_neon:    10636.7   10627.8
      vp9_put_8tap_smooth_64hv_neon:   21076.8   21026.9
      vp9_put_8tap_smooth_64v_neon:     9635.0    9632.4
      
      These are generally about as fast as the corresponding ARM
      routines on the same CPU (at least on the A53), in most cases
      marginally faster.
      
      The speedup vs C code is pretty much the same as for the 32 bit
      case; on the A53 it's around 6-13x for ther larger 8tap filters.
      The exact speedup varies a little, since the C versions generally
      don't end up exactly as slow/fast as on 32 bit.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      383d96aa
  3. 07 Apr, 2016 1 commit
    • Diego Biurrun's avatar
      build: miscellaneous cosmetics · 01621202
      Diego Biurrun authored
      Restore alphabetical order in lists, break overly long lines, do some
      prettyprinting, add some explanatory section comments, group parts
      together that belong together logically.
      01621202
  4. 01 Mar, 2016 1 commit
  5. 14 Dec, 2015 3 commits
    • Janne Grunau's avatar
      arm64: int32_to_float_fmul neon asm · a0fc780a
      Janne Grunau authored
      3% faster dts decoding on a cortex-a57.
      
                                       cortex-a57   cortex-a53
      int32_to_float_fmul_array8_c:    1270.9       4475.6
      int32_to_float_fmul_array8_neon:  328.6        569.2
      int32_to_float_fmul_scalar_c:     928.5       4119.6
      int32_to_float_fmul_scalar_neon:  309.1        524.1
      a0fc780a
    • Janne Grunau's avatar
      arm64: port synth_filter_float_neon from arm · 705f5e5e
      Janne Grunau authored
      ~25% faster dts decoding overall. The checkasm CPU cycles numbers are
      not that useful since synth_filter_float() calls FFTContext.imdct_half().
      
                               cortex-a57   cortex-a53
      synth_filter_float_c:    1866.2       3490.9
      synth_filter_float_neon:  915.0       1531.5
      
      With fftc.imdct_half forced to imdct_half_neon:
                               cortex-a57   cortex-a53
      synth_filter_float_c:    1718.4       3025.3
      synth_filter_float_neon:  926.2       1530.1
      705f5e5e
    • Janne Grunau's avatar
      arm64: convert dcadsp neon asm from arm · c33c1fa8
      Janne Grunau authored
      ~2% faster dts decoding overall.
      
                          cortex-a57   cortex-a53
      dca_decode_hf_c:    474.8        1659.9
      dca_decode_hf_neon: 225.2         301.1
      dca_lfe_fir0_c:     913.2        1537.7
      dca_lfe_fir0_neon:  286.8         451.9
      dca_lfe_fir1_c:     848.7        1711.5
      dca_lfe_fir1_neon:  387.1         506.4
      c33c1fa8
  6. 20 Jul, 2015 1 commit
  7. 02 Feb, 2015 1 commit
  8. 15 May, 2014 1 commit
  9. 22 Apr, 2014 4 commits
  10. 06 Apr, 2014 1 commit
  11. 20 Mar, 2014 1 commit
  12. 15 Jan, 2014 6 commits