• Martin Storsjö's avatar
    aarch64: vp9: Add NEON optimizations of VP9 MC functions · 383d96aa
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    These are ported from the ARM version; it is essentially a 1:1
    port with no extra added features, but with some hand tuning
    (especially for the plain copy/avg functions). The ARM version
    isn't very register starved to begin with, so there's not much
    to be gained from having more spare registers here - we only
    avoid having to clobber callee-saved registers.
    
    Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                         ARM   AArch64
    vp9_avg4_neon:                      27.2      23.7
    vp9_avg8_neon:                      56.5      54.7
    vp9_avg16_neon:                    169.9     167.4
    vp9_avg32_neon:                    585.8     585.2
    vp9_avg64_neon:                   2460.3    2294.7
    vp9_avg_8tap_smooth_4h_neon:       132.7     125.2
    vp9_avg_8tap_smooth_4hv_neon:      478.8     442.0
    vp9_avg_8tap_smooth_4v_neon:       126.0      93.7
    vp9_avg_8tap_smooth_8h_neon:       241.7     234.2
    vp9_avg_8tap_smooth_8hv_neon:      690.9     646.5
    vp9_avg_8tap_smooth_8v_neon:       245.0     205.5
    vp9_avg_8tap_smooth_64h_neon:    11273.2   11280.1
    vp9_avg_8tap_smooth_64hv_neon:   22980.6   22184.1
    vp9_avg_8tap_smooth_64v_neon:    11549.7   10781.1
    vp9_put4_neon:                      18.0      17.2
    vp9_put8_neon:                      40.2      37.7
    vp9_put16_neon:                     97.4      99.5
    vp9_put32_neon/armv8:              346.0     307.4
    vp9_put64_neon/armv8:             1319.0    1107.5
    vp9_put_8tap_smooth_4h_neon:       126.7     118.2
    vp9_put_8tap_smooth_4hv_neon:      465.7     434.0
    vp9_put_8tap_smooth_4v_neon:       113.0      86.5
    vp9_put_8tap_smooth_8h_neon:       229.7     221.6
    vp9_put_8tap_smooth_8hv_neon:      658.9     621.3
    vp9_put_8tap_smooth_8v_neon:       215.0     187.5
    vp9_put_8tap_smooth_64h_neon:    10636.7   10627.8
    vp9_put_8tap_smooth_64hv_neon:   21076.8   21026.9
    vp9_put_8tap_smooth_64v_neon:     9635.0    9632.4
    
    These are generally about as fast as the corresponding ARM
    routines on the same CPU (at least on the A53), in most cases
    marginally faster.
    
    The speedup vs C code is pretty much the same as for the 32 bit
    case; on the A53 it's around 6-13x for ther larger 8tap filters.
    The exact speedup varies a little, since the C versions generally
    don't end up exactly as slow/fast as on 32 bit.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    383d96aa
vp9mc_neon.S 24.3 KB