• Martin Storsjö's avatar
    aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter · 9f10cff6
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    This is similar to the arm version, but due to the larger registers
    on aarch64, we can do 8 pixels at a time for all filter sizes.
    
    Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                                 ARM AArch64
    vp9_loop_filter_h_4_8_10bpp_neon:          213.2   172.6
    vp9_loop_filter_h_8_8_10bpp_neon:          281.2   244.2
    vp9_loop_filter_h_16_8_10bpp_neon:         657.0   444.5
    vp9_loop_filter_h_16_16_10bpp_neon:       1280.4   877.7
    vp9_loop_filter_mix2_h_44_16_10bpp_neon:   397.7   358.0
    vp9_loop_filter_mix2_h_48_16_10bpp_neon:   465.7   429.0
    vp9_loop_filter_mix2_h_84_16_10bpp_neon:   465.7   428.0
    vp9_loop_filter_mix2_h_88_16_10bpp_neon:   533.7   499.0
    vp9_loop_filter_mix2_v_44_16_10bpp_neon:   271.5   244.0
    vp9_loop_filter_mix2_v_48_16_10bpp_neon:   330.0   305.0
    vp9_loop_filter_mix2_v_84_16_10bpp_neon:   329.0   306.0
    vp9_loop_filter_mix2_v_88_16_10bpp_neon:   386.0   365.0
    vp9_loop_filter_v_4_8_10bpp_neon:          150.0   115.2
    vp9_loop_filter_v_8_8_10bpp_neon:          209.0   175.5
    vp9_loop_filter_v_16_8_10bpp_neon:         492.7   345.2
    vp9_loop_filter_v_16_16_10bpp_neon:        951.0   682.7
    
    This is significantly faster than the ARM version in almost
    all cases except for the mix2 functions.
    
    Based on START_TIMER/STOP_TIMER wrapping around a few individual
    functions, the speedup vs C code is around 2-3x.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    9f10cff6
Name
Last commit
Last update
compat Loading commit data...
doc Loading commit data...
libavcodec Loading commit data...
libavdevice Loading commit data...
libavfilter Loading commit data...
libavformat Loading commit data...
libavresample Loading commit data...
libavutil Loading commit data...
libpostproc Loading commit data...
libswresample Loading commit data...
libswscale Loading commit data...
presets Loading commit data...
tests Loading commit data...
tools Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
.travis.yml Loading commit data...
CONTRIBUTING.md Loading commit data...
COPYING.GPLv2 Loading commit data...
COPYING.GPLv3 Loading commit data...
COPYING.LGPLv2.1 Loading commit data...
COPYING.LGPLv3 Loading commit data...
CREDITS Loading commit data...
Changelog Loading commit data...
INSTALL.md Loading commit data...
LICENSE.md Loading commit data...
MAINTAINERS Loading commit data...
Makefile Loading commit data...
README.md Loading commit data...
RELEASE Loading commit data...
arch.mak Loading commit data...
cmdutils.c Loading commit data...
cmdutils.h Loading commit data...
cmdutils_common_opts.h Loading commit data...
cmdutils_opencl.c Loading commit data...
common.mak Loading commit data...
configure Loading commit data...
ffmpeg.c Loading commit data...
ffmpeg.h Loading commit data...
ffmpeg_cuvid.c Loading commit data...
ffmpeg_dxva2.c Loading commit data...
ffmpeg_filter.c Loading commit data...
ffmpeg_opt.c Loading commit data...
ffmpeg_qsv.c Loading commit data...
ffmpeg_vaapi.c Loading commit data...
ffmpeg_vdpau.c Loading commit data...
ffmpeg_videotoolbox.c Loading commit data...
ffplay.c Loading commit data...
ffprobe.c Loading commit data...
ffserver.c Loading commit data...
ffserver_config.c Loading commit data...
ffserver_config.h Loading commit data...
library.mak Loading commit data...
version.sh Loading commit data...