• Martin Storsjö's avatar
    aarch64: vp9: Add NEON itxfm routines · 3c9546df
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    These are ported from the ARM version; thanks to the larger
    amount of registers available, we can do the 16x16 and 32x32
    transforms in slices 8 pixels wide instead of 4. This gives
    a speedup of around 1.4x compared to the 32 bit version.
    
    The fact that aarch64 doesn't have the same d/q register
    aliasing makes some of the macros quite a bit simpler as well.
    
    Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                           ARM  AArch64
    vp9_inv_adst_adst_4x4_add_neon:       90.0     87.7
    vp9_inv_adst_adst_8x8_add_neon:      400.0    354.7
    vp9_inv_adst_adst_16x16_add_neon:   2526.5   1827.2
    vp9_inv_dct_dct_4x4_add_neon:         74.0     72.7
    vp9_inv_dct_dct_8x8_add_neon:        271.0    256.7
    vp9_inv_dct_dct_16x16_add_neon:     1960.7   1372.7
    vp9_inv_dct_dct_32x32_add_neon:    11988.9   8088.3
    vp9_inv_wht_wht_4x4_add_neon:         63.0     57.7
    
    The speedup vs C code (2-4x) is smaller than in the 32 bit case,
    mostly because the C code ends up significantly faster (around
    1.6x faster, with GCC 5.4) when built for aarch64.
    
    Examples of runtimes vs C on a Cortex A57 (for a slightly older version
    of the patch):
                                    A57 gcc-5.3   neon
    vp9_inv_adst_adst_4x4_add_neon:       152.2   60.0
    vp9_inv_adst_adst_8x8_add_neon:       948.2  288.0
    vp9_inv_adst_adst_16x16_add_neon:    4830.4 1380.5
    vp9_inv_dct_dct_4x4_add_neon:         153.0   58.6
    vp9_inv_dct_dct_8x8_add_neon:         789.2  180.2
    vp9_inv_dct_dct_16x16_add_neon:      3639.6  917.1
    vp9_inv_dct_dct_32x32_add_neon:     20462.1 4985.0
    vp9_inv_wht_wht_4x4_add_neon:          91.0   49.8
    
    The asm is around factor 3-4 faster than C on the cortex-a57 and the asm
    is around 30-50% faster on the a57 compared to the a53.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    3c9546df
vp9itxfm_neon.S 45.2 KB