• Martin Storsjö's avatar
    arm: vp9lpf: Implement the mix2_44 function with one single filter pass · a88db8b9
    Martin Storsjö authored
    For this case, with 8 inputs but only changing 4 of them, we can fit
    all 16 input pixels into a q register, and still have enough temporary
    registers for doing the loop filter.
    
    The wd=8 filters would require too many temporary registers for
    processing all 16 pixels at once though.
    
    Before:                          Cortex A7      A8     A9     A53
    vp9_loop_filter_mix2_v_44_16_neon:   289.7   256.2  237.5   181.2
    After:
    vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  177.7   138.0
    
    This is cherrypicked from libav commit
    575e31e9.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    a88db8b9
vp9lpf_neon.S 36.4 KB