1. 24 Jan, 2017 1 commit
    • Martin Storsjö's avatar
      aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter · 9f10cff6
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This is similar to the arm version, but due to the larger registers
      on aarch64, we can do 8 pixels at a time for all filter sizes.
      
      Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                                   ARM AArch64
      vp9_loop_filter_h_4_8_10bpp_neon:          213.2   172.6
      vp9_loop_filter_h_8_8_10bpp_neon:          281.2   244.2
      vp9_loop_filter_h_16_8_10bpp_neon:         657.0   444.5
      vp9_loop_filter_h_16_16_10bpp_neon:       1280.4   877.7
      vp9_loop_filter_mix2_h_44_16_10bpp_neon:   397.7   358.0
      vp9_loop_filter_mix2_h_48_16_10bpp_neon:   465.7   429.0
      vp9_loop_filter_mix2_h_84_16_10bpp_neon:   465.7   428.0
      vp9_loop_filter_mix2_h_88_16_10bpp_neon:   533.7   499.0
      vp9_loop_filter_mix2_v_44_16_10bpp_neon:   271.5   244.0
      vp9_loop_filter_mix2_v_48_16_10bpp_neon:   330.0   305.0
      vp9_loop_filter_mix2_v_84_16_10bpp_neon:   329.0   306.0
      vp9_loop_filter_mix2_v_88_16_10bpp_neon:   386.0   365.0
      vp9_loop_filter_v_4_8_10bpp_neon:          150.0   115.2
      vp9_loop_filter_v_8_8_10bpp_neon:          209.0   175.5
      vp9_loop_filter_v_16_8_10bpp_neon:         492.7   345.2
      vp9_loop_filter_v_16_16_10bpp_neon:        951.0   682.7
      
      This is significantly faster than the ARM version in almost
      all cases except for the mix2 functions.
      
      Based on START_TIMER/STOP_TIMER wrapping around a few individual
      functions, the speedup vs C code is around 2-3x.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      9f10cff6