• Martin Storsjö's avatar
    arm: vp9: Add NEON loop filters · dd299a2d
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    The implementation tries to have smart handling of cases
    where no pixels need the full filtering for the 8/16 width
    filters, skipping both calculation and writeback of the
    unmodified pixels in those cases. The actual effect of this
    is hard to test with checkasm though, since it tests the
    full filtering, and the benefit depends on how many filtered
    blocks use the shortcut.
    
    Examples of relative speedup compared to the C version, from checkasm:
                              Cortex       A7     A8     A9    A53
    vp9_loop_filter_h_4_8_neon:          2.72   2.68   1.78   3.15
    vp9_loop_filter_h_8_8_neon:          2.36   2.38   1.70   2.91
    vp9_loop_filter_h_16_8_neon:         1.80   1.89   1.45   2.01
    vp9_loop_filter_h_16_16_neon:        2.81   2.78   2.18   3.16
    vp9_loop_filter_mix2_h_44_16_neon:   2.65   2.67   1.93   3.05
    vp9_loop_filter_mix2_h_48_16_neon:   2.46   2.38   1.81   2.85
    vp9_loop_filter_mix2_h_84_16_neon:   2.50   2.41   1.73   2.85
    vp9_loop_filter_mix2_h_88_16_neon:   2.77   2.66   1.96   3.23
    vp9_loop_filter_mix2_v_44_16_neon:   4.28   4.46   3.22   5.70
    vp9_loop_filter_mix2_v_48_16_neon:   3.92   4.00   3.03   5.19
    vp9_loop_filter_mix2_v_84_16_neon:   3.97   4.31   2.98   5.33
    vp9_loop_filter_mix2_v_88_16_neon:   3.91   4.19   3.06   5.18
    vp9_loop_filter_v_4_8_neon:          4.53   4.47   3.31   6.05
    vp9_loop_filter_v_8_8_neon:          3.58   3.99   2.92   5.17
    vp9_loop_filter_v_16_8_neon:         3.40   3.50   2.81   4.68
    vp9_loop_filter_v_16_16_neon:        4.66   4.41   3.74   6.02
    
    The speedup vs C code is around 2-6x. The numbers are quite
    inconclusive though, since the checkasm test runs multiple filterings
    on top of each other, so later rounds might end up with different
    codepaths (different decisions on which filter to apply, based
    on input pixel differences). Disabling the early-exit in the asm
    doesn't give a fair comparison either though, since the C code
    only does the necessary calcuations for each row.
    
    Based on START_TIMER/STOP_TIMER wrapping around a few individual
    functions, the speedup vs C code is around 4-9x.
    
    This is pretty similar in runtime to the corresponding routines
    in libvpx. (This is comparing vpx_lpf_vertical_16_neon,
    vpx_lpf_horizontal_edge_8_neon and vpx_lpf_horizontal_edge_16_neon
    to vp9_loop_filter_h_16_8_neon, vp9_loop_filter_v_16_8_neon
    and vp9_loop_filter_v_16_16_neon - note that the naming of horizonal
    and vertical is flipped between the libraries.)
    
    In order to have stable, comparable numbers, the early exits in both
    asm versions were disabled, forcing the full filtering codepath.
    
                               Cortex           A7      A8      A9     A53
    vp9_loop_filter_h_16_8_neon:             597.2   472.0   482.4   415.0
    libvpx vpx_lpf_vertical_16_neon:         626.0   464.5   470.7   445.0
    vp9_loop_filter_v_16_8_neon:             500.2   422.5   429.7   295.0
    libvpx vpx_lpf_horizontal_edge_8_neon:   586.5   414.5   415.6   383.2
    vp9_loop_filter_v_16_16_neon:            905.0   784.7   791.5   546.0
    libvpx vpx_lpf_horizontal_edge_16_neon: 1060.2   751.7   743.5   685.2
    
    Our version is consistently faster on on A7 and A53, marginally slower on
    A8, and sometimes faster, sometimes slower on A9 (marginally slower in all
    three tests in this particular test run).
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    dd299a2d
Name
Last commit
Last update
compat Loading commit data...
doc Loading commit data...
libavcodec Loading commit data...
libavdevice Loading commit data...
libavfilter Loading commit data...
libavformat Loading commit data...
libavresample Loading commit data...
libavutil Loading commit data...
libswscale Loading commit data...
presets Loading commit data...
tests Loading commit data...
tools Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
.travis.yml Loading commit data...
COPYING.GPLv2 Loading commit data...
COPYING.GPLv3 Loading commit data...
COPYING.LGPLv2.1 Loading commit data...
COPYING.LGPLv3 Loading commit data...
CREDITS Loading commit data...
Changelog Loading commit data...
INSTALL Loading commit data...
LICENSE Loading commit data...
Makefile Loading commit data...
README Loading commit data...
README.md Loading commit data...
RELEASE Loading commit data...
arch.mak Loading commit data...
avconv.c Loading commit data...
avconv.h Loading commit data...
avconv_dxva2.c Loading commit data...
avconv_filter.c Loading commit data...
avconv_opt.c Loading commit data...
avconv_qsv.c Loading commit data...
avconv_vaapi.c Loading commit data...
avconv_vda.c Loading commit data...
avconv_vdpau.c Loading commit data...
avplay.c Loading commit data...
avprobe.c Loading commit data...
cmdutils.c Loading commit data...
cmdutils.h Loading commit data...
cmdutils_common_opts.h Loading commit data...
common.mak Loading commit data...
configure Loading commit data...
library.mak Loading commit data...
version.sh Loading commit data...