• Martin Storsjö's avatar
    arm: vp9: Add NEON loop filters · 6bec60a6
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    The implementation tries to have smart handling of cases
    where no pixels need the full filtering for the 8/16 width
    filters, skipping both calculation and writeback of the
    unmodified pixels in those cases. The actual effect of this
    is hard to test with checkasm though, since it tests the
    full filtering, and the benefit depends on how many filtered
    blocks use the shortcut.
    
    Examples of relative speedup compared to the C version, from checkasm:
                              Cortex       A7     A8     A9    A53
    vp9_loop_filter_h_4_8_neon:          2.72   2.68   1.78   3.15
    vp9_loop_filter_h_8_8_neon:          2.36   2.38   1.70   2.91
    vp9_loop_filter_h_16_8_neon:         1.80   1.89   1.45   2.01
    vp9_loop_filter_h_16_16_neon:        2.81   2.78   2.18   3.16
    vp9_loop_filter_mix2_h_44_16_neon:   2.65   2.67   1.93   3.05
    vp9_loop_filter_mix2_h_48_16_neon:   2.46   2.38   1.81   2.85
    vp9_loop_filter_mix2_h_84_16_neon:   2.50   2.41   1.73   2.85
    vp9_loop_filter_mix2_h_88_16_neon:   2.77   2.66   1.96   3.23
    vp9_loop_filter_mix2_v_44_16_neon:   4.28   4.46   3.22   5.70
    vp9_loop_filter_mix2_v_48_16_neon:   3.92   4.00   3.03   5.19
    vp9_loop_filter_mix2_v_84_16_neon:   3.97   4.31   2.98   5.33
    vp9_loop_filter_mix2_v_88_16_neon:   3.91   4.19   3.06   5.18
    vp9_loop_filter_v_4_8_neon:          4.53   4.47   3.31   6.05
    vp9_loop_filter_v_8_8_neon:          3.58   3.99   2.92   5.17
    vp9_loop_filter_v_16_8_neon:         3.40   3.50   2.81   4.68
    vp9_loop_filter_v_16_16_neon:        4.66   4.41   3.74   6.02
    
    The speedup vs C code is around 2-6x. The numbers are quite
    inconclusive though, since the checkasm test runs multiple filterings
    on top of each other, so later rounds might end up with different
    codepaths (different decisions on which filter to apply, based
    on input pixel differences). Disabling the early-exit in the asm
    doesn't give a fair comparison either though, since the C code
    only does the necessary calcuations for each row.
    
    Based on START_TIMER/STOP_TIMER wrapping around a few individual
    functions, the speedup vs C code is around 4-9x.
    
    This is pretty similar in runtime to the corresponding routines
    in libvpx. (This is comparing vpx_lpf_vertical_16_neon,
    vpx_lpf_horizontal_edge_8_neon and vpx_lpf_horizontal_edge_16_neon
    to vp9_loop_filter_h_16_8_neon, vp9_loop_filter_v_16_8_neon
    and vp9_loop_filter_v_16_16_neon - note that the naming of horizonal
    and vertical is flipped between the libraries.)
    
    In order to have stable, comparable numbers, the early exits in both
    asm versions were disabled, forcing the full filtering codepath.
    
                               Cortex           A7      A8      A9     A53
    vp9_loop_filter_h_16_8_neon:             597.2   472.0   482.4   415.0
    libvpx vpx_lpf_vertical_16_neon:         626.0   464.5   470.7   445.0
    vp9_loop_filter_v_16_8_neon:             500.2   422.5   429.7   295.0
    libvpx vpx_lpf_horizontal_edge_8_neon:   586.5   414.5   415.6   383.2
    vp9_loop_filter_v_16_16_neon:            905.0   784.7   791.5   546.0
    libvpx vpx_lpf_horizontal_edge_16_neon: 1060.2   751.7   743.5   685.2
    
    Our version is consistently faster on on A7 and A53, marginally slower on
    A8, and sometimes faster, sometimes slower on A9 (marginally slower in all
    three tests in this particular test run).
    
    This is an adapted cherry-pick from libav commit
    dd299a2d.
    Signed-off-by: 's avatarRonald S. Bultje <rsbultje@gmail.com>
    6bec60a6
vp9lpf_neon.S 28.6 KB