• Martin Storsjö's avatar
    aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter · 9f10cff6
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    This is similar to the arm version, but due to the larger registers
    on aarch64, we can do 8 pixels at a time for all filter sizes.
    
    Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                                 ARM AArch64
    vp9_loop_filter_h_4_8_10bpp_neon:          213.2   172.6
    vp9_loop_filter_h_8_8_10bpp_neon:          281.2   244.2
    vp9_loop_filter_h_16_8_10bpp_neon:         657.0   444.5
    vp9_loop_filter_h_16_16_10bpp_neon:       1280.4   877.7
    vp9_loop_filter_mix2_h_44_16_10bpp_neon:   397.7   358.0
    vp9_loop_filter_mix2_h_48_16_10bpp_neon:   465.7   429.0
    vp9_loop_filter_mix2_h_84_16_10bpp_neon:   465.7   428.0
    vp9_loop_filter_mix2_h_88_16_10bpp_neon:   533.7   499.0
    vp9_loop_filter_mix2_v_44_16_10bpp_neon:   271.5   244.0
    vp9_loop_filter_mix2_v_48_16_10bpp_neon:   330.0   305.0
    vp9_loop_filter_mix2_v_84_16_10bpp_neon:   329.0   306.0
    vp9_loop_filter_mix2_v_88_16_10bpp_neon:   386.0   365.0
    vp9_loop_filter_v_4_8_10bpp_neon:          150.0   115.2
    vp9_loop_filter_v_8_8_10bpp_neon:          209.0   175.5
    vp9_loop_filter_v_16_8_10bpp_neon:         492.7   345.2
    vp9_loop_filter_v_16_16_10bpp_neon:        951.0   682.7
    
    This is significantly faster than the ARM version in almost
    all cases except for the mix2 functions.
    
    Based on START_TIMER/STOP_TIMER wrapping around a few individual
    functions, the speedup vs C code is around 2-3x.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    9f10cff6
Name
Last commit
Last update
..
Makefile Loading commit data...
asm-offsets.h Loading commit data...
cabac.h Loading commit data...
fft_init_aarch64.c Loading commit data...
fft_neon.S Loading commit data...
fmtconvert_init.c Loading commit data...
fmtconvert_neon.S Loading commit data...
h264chroma_init_aarch64.c Loading commit data...
h264cmc_neon.S Loading commit data...
h264dsp_init_aarch64.c Loading commit data...
h264dsp_neon.S Loading commit data...
h264idct_neon.S Loading commit data...
h264pred_init.c Loading commit data...
h264pred_neon.S Loading commit data...
h264qpel_init_aarch64.c Loading commit data...
h264qpel_neon.S Loading commit data...
hpeldsp_init_aarch64.c Loading commit data...
hpeldsp_neon.S Loading commit data...
mdct_neon.S Loading commit data...
mpegaudiodsp_init.c Loading commit data...
mpegaudiodsp_neon.S Loading commit data...
neon.S Loading commit data...
neontest.c Loading commit data...
rv40dsp_init_aarch64.c Loading commit data...
synth_filter_init.c Loading commit data...
synth_filter_neon.S Loading commit data...
vc1dsp_init_aarch64.c Loading commit data...
videodsp.S Loading commit data...
videodsp_init.c Loading commit data...
vorbisdsp_init.c Loading commit data...
vorbisdsp_neon.S Loading commit data...
vp9dsp_init.h Loading commit data...
vp9dsp_init_10bpp_aarch64.c Loading commit data...
vp9dsp_init_12bpp_aarch64.c Loading commit data...
vp9dsp_init_16bpp_aarch64_template.c Loading commit data...
vp9dsp_init_aarch64.c Loading commit data...
vp9itxfm_16bpp_neon.S Loading commit data...
vp9itxfm_neon.S Loading commit data...
vp9lpf_16bpp_neon.S Loading commit data...
vp9lpf_neon.S Loading commit data...
vp9mc_16bpp_neon.S Loading commit data...
vp9mc_neon.S Loading commit data...