• Martin Storsjö's avatar
    aarch64: vp8: Optimize vp8_idct_add_neon for aarch64 · 7e42d5f0
    Martin Storsjö authored
    The previous version was a pretty exact translation of the arm
    version. This version does do some unnecessary arithemetic (it does
    more operations on vectors that are only half filled; it does 4
    uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead
    of packing data together (which could be done for free in the arm
    version).
    
    This gives a decent speedup on Cortex A53, a minor speedup on
    A72 and a very minor slowdown on Cortex A73.
    
    Before:        Cortex A53    A72    A73
    vp8_idct_add_neon:   79.7   67.5   65.0
    After:
    vp8_idct_add_neon:   67.7   64.8   66.7
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    7e42d5f0
vp8dsp_neon.S 65.3 KB