• Martin Storsjö's avatar
    arm: Add NEON optimizations for 10 and 12 bit vp9 itxfm · 2ed67eba
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    This is structured similarly to the 8 bit version. In the 8 bit
    version, the coefficients are 16 bits, and intermediates are 32 bits.
    
    Here, the coefficients are 32 bit. For the 4x4 transforms for 10 bit
    content, the intermediates also fit in 32 bits, but for all other
    transforms (4x4 for 12 bit content, and 8x8 and larger for both 10
    and 12 bit) the intermediates are 64 bit.
    
    For the existing 8 bit case, the 8x8 transform fit all coefficients in
    registers; for 10/12 bit, when the coefficients are 32 bit, the 8x8
    transform also has to be done in slices of 4 pixels (just as 16x16 and
    32x32 for 8 bit).
    
    The slice width also shrinks from 4 elements to 2 elements in parallel
    for the 16x16 and 32x32 cases.
    
    The 16 bit coefficients from idct_coeffs and similar tables also need
    to be lenghtened to 32 bit in order to be used in multiplication with
    vectors with 32 bit elements. This leads to the fixed coefficient
    vectors needing more space, leading to more cases where they have to
    be reloaded within the transform (in iadst16).
    
    This technically would need testing in checkasm for subpartitions
    in increments of 2, but that slows down normal checkasm runs
    excessively.
    
    Examples of relative speedup compared to the C version, from checkasm:
                                         Cortex    A7     A8     A9    A53
    vp9_inv_adst_adst_4x4_sub4_add_10_neon:      4.83  11.36   5.22   6.77
    vp9_inv_adst_adst_8x8_sub8_add_10_neon:      4.12   7.60   4.06   4.84
    vp9_inv_adst_adst_16x16_sub16_add_10_neon:   3.93   8.16   4.52   5.35
    vp9_inv_dct_dct_4x4_sub1_add_10_neon:        1.36   2.57   1.41   1.61
    vp9_inv_dct_dct_4x4_sub4_add_10_neon:        4.24   8.66   5.06   5.81
    vp9_inv_dct_dct_8x8_sub1_add_10_neon:        2.63   4.18   1.68   2.87
    vp9_inv_dct_dct_8x8_sub4_add_10_neon:        4.52   9.47   4.24   5.39
    vp9_inv_dct_dct_8x8_sub8_add_10_neon:        3.45   7.34   3.45   4.30
    vp9_inv_dct_dct_16x16_sub1_add_10_neon:      3.56   6.21   2.47   4.32
    vp9_inv_dct_dct_16x16_sub2_add_10_neon:      5.68  12.73   5.28   7.07
    vp9_inv_dct_dct_16x16_sub8_add_10_neon:      4.42   9.28   4.24   5.45
    vp9_inv_dct_dct_16x16_sub16_add_10_neon:     3.41   7.29   3.35   4.19
    vp9_inv_dct_dct_32x32_sub1_add_10_neon:      4.52   8.35   3.83   6.40
    vp9_inv_dct_dct_32x32_sub2_add_10_neon:      5.86  13.19   6.14   7.04
    vp9_inv_dct_dct_32x32_sub16_add_10_neon:     4.29   8.11   4.59   5.06
    vp9_inv_dct_dct_32x32_sub32_add_10_neon:     3.31   5.70   3.56   3.84
    vp9_inv_wht_wht_4x4_sub4_add_10_neon:        1.89   2.80   1.82   1.97
    
    The speedup compared to the C functions is around 1.3 to 7x for the
    full transforms, even higher for the smaller subpartitions.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    2ed67eba
Name
Last commit
Last update
compat Loading commit data...
doc Loading commit data...
libavcodec Loading commit data...
libavdevice Loading commit data...
libavfilter Loading commit data...
libavformat Loading commit data...
libavresample Loading commit data...
libavutil Loading commit data...
libpostproc Loading commit data...
libswresample Loading commit data...
libswscale Loading commit data...
presets Loading commit data...
tests Loading commit data...
tools Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
.travis.yml Loading commit data...
CONTRIBUTING.md Loading commit data...
COPYING.GPLv2 Loading commit data...
COPYING.GPLv3 Loading commit data...
COPYING.LGPLv2.1 Loading commit data...
COPYING.LGPLv3 Loading commit data...
CREDITS Loading commit data...
Changelog Loading commit data...
INSTALL.md Loading commit data...
LICENSE.md Loading commit data...
MAINTAINERS Loading commit data...
Makefile Loading commit data...
README.md Loading commit data...
RELEASE Loading commit data...
arch.mak Loading commit data...
cmdutils.c Loading commit data...
cmdutils.h Loading commit data...
cmdutils_common_opts.h Loading commit data...
cmdutils_opencl.c Loading commit data...
common.mak Loading commit data...
configure Loading commit data...
ffmpeg.c Loading commit data...
ffmpeg.h Loading commit data...
ffmpeg_cuvid.c Loading commit data...
ffmpeg_dxva2.c Loading commit data...
ffmpeg_filter.c Loading commit data...
ffmpeg_opt.c Loading commit data...
ffmpeg_qsv.c Loading commit data...
ffmpeg_vaapi.c Loading commit data...
ffmpeg_vdpau.c Loading commit data...
ffmpeg_videotoolbox.c Loading commit data...
ffplay.c Loading commit data...
ffprobe.c Loading commit data...
ffserver.c Loading commit data...
ffserver_config.c Loading commit data...
ffserver_config.h Loading commit data...
library.mak Loading commit data...
version.sh Loading commit data...