• Martin Storsjö's avatar
    arm: Add NEON optimizations for 10 and 12 bit vp9 itxfm · 2ed67eba
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    This is structured similarly to the 8 bit version. In the 8 bit
    version, the coefficients are 16 bits, and intermediates are 32 bits.
    
    Here, the coefficients are 32 bit. For the 4x4 transforms for 10 bit
    content, the intermediates also fit in 32 bits, but for all other
    transforms (4x4 for 12 bit content, and 8x8 and larger for both 10
    and 12 bit) the intermediates are 64 bit.
    
    For the existing 8 bit case, the 8x8 transform fit all coefficients in
    registers; for 10/12 bit, when the coefficients are 32 bit, the 8x8
    transform also has to be done in slices of 4 pixels (just as 16x16 and
    32x32 for 8 bit).
    
    The slice width also shrinks from 4 elements to 2 elements in parallel
    for the 16x16 and 32x32 cases.
    
    The 16 bit coefficients from idct_coeffs and similar tables also need
    to be lenghtened to 32 bit in order to be used in multiplication with
    vectors with 32 bit elements. This leads to the fixed coefficient
    vectors needing more space, leading to more cases where they have to
    be reloaded within the transform (in iadst16).
    
    This technically would need testing in checkasm for subpartitions
    in increments of 2, but that slows down normal checkasm runs
    excessively.
    
    Examples of relative speedup compared to the C version, from checkasm:
                                         Cortex    A7     A8     A9    A53
    vp9_inv_adst_adst_4x4_sub4_add_10_neon:      4.83  11.36   5.22   6.77
    vp9_inv_adst_adst_8x8_sub8_add_10_neon:      4.12   7.60   4.06   4.84
    vp9_inv_adst_adst_16x16_sub16_add_10_neon:   3.93   8.16   4.52   5.35
    vp9_inv_dct_dct_4x4_sub1_add_10_neon:        1.36   2.57   1.41   1.61
    vp9_inv_dct_dct_4x4_sub4_add_10_neon:        4.24   8.66   5.06   5.81
    vp9_inv_dct_dct_8x8_sub1_add_10_neon:        2.63   4.18   1.68   2.87
    vp9_inv_dct_dct_8x8_sub4_add_10_neon:        4.52   9.47   4.24   5.39
    vp9_inv_dct_dct_8x8_sub8_add_10_neon:        3.45   7.34   3.45   4.30
    vp9_inv_dct_dct_16x16_sub1_add_10_neon:      3.56   6.21   2.47   4.32
    vp9_inv_dct_dct_16x16_sub2_add_10_neon:      5.68  12.73   5.28   7.07
    vp9_inv_dct_dct_16x16_sub8_add_10_neon:      4.42   9.28   4.24   5.45
    vp9_inv_dct_dct_16x16_sub16_add_10_neon:     3.41   7.29   3.35   4.19
    vp9_inv_dct_dct_32x32_sub1_add_10_neon:      4.52   8.35   3.83   6.40
    vp9_inv_dct_dct_32x32_sub2_add_10_neon:      5.86  13.19   6.14   7.04
    vp9_inv_dct_dct_32x32_sub16_add_10_neon:     4.29   8.11   4.59   5.06
    vp9_inv_dct_dct_32x32_sub32_add_10_neon:     3.31   5.70   3.56   3.84
    vp9_inv_wht_wht_4x4_sub4_add_10_neon:        1.89   2.80   1.82   1.97
    
    The speedup compared to the C functions is around 1.3 to 7x for the
    full transforms, even higher for the smaller subpartitions.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    2ed67eba
vp9itxfm_16bpp_neon.S 55 KB