• Martin Storsjö's avatar
    aarch64: vp9: Implement NEON loop filters · f1212e47
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    These are ported from the ARM version; thanks to the larger
    amount of registers available, we can do the loop filters with
    16 pixels at a time. The implementation is fully templated, with
    a single macro which can generate versions for both 8 and
    16 pixels wide, for both 4, 8 and 16 pixels loop filters
    (and the 4/8 mixed versions as well).
    
    For the 8 pixel wide versions, it is pretty close in speed (the
    v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
    and h_8_8 filters seem to get some gain in the load/transpose/store
    part). For the 16 pixels wide ones, we get a speedup of around
    1.2-1.4x compared to the 32 bit version.
    
    Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                           ARM AArch64
    vp9_loop_filter_h_4_8_neon:          144.0   127.2
    vp9_loop_filter_h_8_8_neon:          207.0   182.5
    vp9_loop_filter_h_16_8_neon:         415.0   328.7
    vp9_loop_filter_h_16_16_neon:        672.0   558.6
    vp9_loop_filter_mix2_h_44_16_neon:   302.0   203.5
    vp9_loop_filter_mix2_h_48_16_neon:   365.0   305.2
    vp9_loop_filter_mix2_h_84_16_neon:   365.0   305.2
    vp9_loop_filter_mix2_h_88_16_neon:   376.0   305.2
    vp9_loop_filter_mix2_v_44_16_neon:   193.2   128.2
    vp9_loop_filter_mix2_v_48_16_neon:   246.7   218.4
    vp9_loop_filter_mix2_v_84_16_neon:   248.0   218.5
    vp9_loop_filter_mix2_v_88_16_neon:   302.0   218.2
    vp9_loop_filter_v_4_8_neon:           89.0    88.7
    vp9_loop_filter_v_8_8_neon:          141.0   137.7
    vp9_loop_filter_v_16_8_neon:         295.0   272.7
    vp9_loop_filter_v_16_16_neon:        546.0   453.7
    
    The speedup vs C code in checkasm tests is around 2-7x, which is
    pretty much the same as for the 32 bit version. Even if these functions
    are faster than their 32 bit equivalent, the C version that we compare
    to also became around 1.3-1.7x faster than the C version in 32 bit.
    
    Based on START_TIMER/STOP_TIMER wrapping around a few individual
    functions, the speedup vs C code is around 4-5x.
    
    Examples of runtimes vs C on a Cortex A57 (for a slightly older version
    of the patch):
                             A57 gcc-5.3  neon
    loop_filter_h_4_8_neon:        256.6  93.4
    loop_filter_h_8_8_neon:        307.3 139.1
    loop_filter_h_16_8_neon:       340.1 254.1
    loop_filter_h_16_16_neon:      827.0 407.9
    loop_filter_mix2_h_44_16_neon: 524.5 155.4
    loop_filter_mix2_h_48_16_neon: 644.5 173.3
    loop_filter_mix2_h_84_16_neon: 630.5 222.0
    loop_filter_mix2_h_88_16_neon: 697.3 222.0
    loop_filter_mix2_v_44_16_neon: 598.5 100.6
    loop_filter_mix2_v_48_16_neon: 651.5 127.0
    loop_filter_mix2_v_84_16_neon: 591.5 167.1
    loop_filter_mix2_v_88_16_neon: 855.1 166.7
    loop_filter_v_4_8_neon:        271.7  65.3
    loop_filter_v_8_8_neon:        312.5 106.9
    loop_filter_v_16_8_neon:       473.3 206.5
    loop_filter_v_16_16_neon:      976.1 327.8
    
    The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
    is again 30-50% faster than the cortex-a53.
    
    This is an adapted cherry-pick from libav commits
    9d2afd1e and
    31756abe.
    Signed-off-by: 's avatarRonald S. Bultje <rsbultje@gmail.com>
    f1212e47
Name
Last commit
Last update
compat Loading commit data...
doc Loading commit data...
libavcodec Loading commit data...
libavdevice Loading commit data...
libavfilter Loading commit data...
libavformat Loading commit data...
libavresample Loading commit data...
libavutil Loading commit data...
libpostproc Loading commit data...
libswresample Loading commit data...
libswscale Loading commit data...
presets Loading commit data...
tests Loading commit data...
tools Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
.travis.yml Loading commit data...
CONTRIBUTING.md Loading commit data...
COPYING.GPLv2 Loading commit data...
COPYING.GPLv3 Loading commit data...
COPYING.LGPLv2.1 Loading commit data...
COPYING.LGPLv3 Loading commit data...
CREDITS Loading commit data...
Changelog Loading commit data...
INSTALL.md Loading commit data...
LICENSE.md Loading commit data...
MAINTAINERS Loading commit data...
Makefile Loading commit data...
README.md Loading commit data...
RELEASE Loading commit data...
arch.mak Loading commit data...
cmdutils.c Loading commit data...
cmdutils.h Loading commit data...
cmdutils_common_opts.h Loading commit data...
cmdutils_opencl.c Loading commit data...
common.mak Loading commit data...
configure Loading commit data...
ffmpeg.c Loading commit data...
ffmpeg.h Loading commit data...
ffmpeg_cuvid.c Loading commit data...
ffmpeg_dxva2.c Loading commit data...
ffmpeg_filter.c Loading commit data...
ffmpeg_opt.c Loading commit data...
ffmpeg_qsv.c Loading commit data...
ffmpeg_vaapi.c Loading commit data...
ffmpeg_vdpau.c Loading commit data...
ffmpeg_videotoolbox.c Loading commit data...
ffplay.c Loading commit data...
ffprobe.c Loading commit data...
ffserver.c Loading commit data...
ffserver_config.c Loading commit data...
ffserver_config.h Loading commit data...
library.mak Loading commit data...
version.sh Loading commit data...