• Martin Storsjö's avatar
    aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter · 9f10cff6
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    This is similar to the arm version, but due to the larger registers
    on aarch64, we can do 8 pixels at a time for all filter sizes.
    
    Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                                 ARM AArch64
    vp9_loop_filter_h_4_8_10bpp_neon:          213.2   172.6
    vp9_loop_filter_h_8_8_10bpp_neon:          281.2   244.2
    vp9_loop_filter_h_16_8_10bpp_neon:         657.0   444.5
    vp9_loop_filter_h_16_16_10bpp_neon:       1280.4   877.7
    vp9_loop_filter_mix2_h_44_16_10bpp_neon:   397.7   358.0
    vp9_loop_filter_mix2_h_48_16_10bpp_neon:   465.7   429.0
    vp9_loop_filter_mix2_h_84_16_10bpp_neon:   465.7   428.0
    vp9_loop_filter_mix2_h_88_16_10bpp_neon:   533.7   499.0
    vp9_loop_filter_mix2_v_44_16_10bpp_neon:   271.5   244.0
    vp9_loop_filter_mix2_v_48_16_10bpp_neon:   330.0   305.0
    vp9_loop_filter_mix2_v_84_16_10bpp_neon:   329.0   306.0
    vp9_loop_filter_mix2_v_88_16_10bpp_neon:   386.0   365.0
    vp9_loop_filter_v_4_8_10bpp_neon:          150.0   115.2
    vp9_loop_filter_v_8_8_10bpp_neon:          209.0   175.5
    vp9_loop_filter_v_16_8_10bpp_neon:         492.7   345.2
    vp9_loop_filter_v_16_16_10bpp_neon:        951.0   682.7
    
    This is significantly faster than the ARM version in almost
    all cases except for the mix2 functions.
    
    Based on START_TIMER/STOP_TIMER wrapping around a few individual
    functions, the speedup vs C code is around 2-3x.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    9f10cff6