• Martin Storsjö's avatar
    arm: vp9itxfm: Avoid reloading the idct32 coefficients · 600f4c9b
    Martin Storsjö authored
    The idct32x32 function actually pushed q4-q7 onto the stack even
    though it didn't clobber them; there are plenty of registers that
    can be used to allow keeping all the idct coefficients in registers
    without having to reload different subsets of them at different
    stages in the transform.
    
    Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
    q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
    in the idct16 function), and the lanewise vmul needs a register in
    the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
    while doing idct16.
    
    While keeping these coefficients in registers, we still can skip pushing
    q7.
    
    Before:                              Cortex A7       A8       A9      A53
    vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
    After:
    vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8
    
    This is cherrypicked from libav commit
    402546a1.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    600f4c9b
vp9itxfm_neon.S 61.7 KB