• Martin Storsjö's avatar
    aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm · ceb36b81
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    Compared to the arm version, on aarch64 we can keep the full 8x8
    transform in registers, and for 16x16 and 32x32, we can process
    it in slices of 4 pixels instead of 2.
    
    Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                                    ARM  AArch64
    vp9_inv_adst_adst_4x4_sub4_add_10_neon:       111.0    109.7
    vp9_inv_adst_adst_8x8_sub8_add_10_neon:       914.0    733.5
    vp9_inv_adst_adst_16x16_sub16_add_10_neon:   5184.0   3745.7
    vp9_inv_dct_dct_4x4_sub1_add_10_neon:          65.0     65.7
    vp9_inv_dct_dct_4x4_sub4_add_10_neon:         100.0     96.7
    vp9_inv_dct_dct_8x8_sub1_add_10_neon:         111.0    119.7
    vp9_inv_dct_dct_8x8_sub8_add_10_neon:         618.0    494.7
    vp9_inv_dct_dct_16x16_sub1_add_10_neon:       295.1    284.6
    vp9_inv_dct_dct_16x16_sub2_add_10_neon:      2303.2   1883.9
    vp9_inv_dct_dct_16x16_sub8_add_10_neon:      2984.8   2189.3
    vp9_inv_dct_dct_16x16_sub16_add_10_neon:     3890.0   2799.4
    vp9_inv_dct_dct_32x32_sub1_add_10_neon:      1044.4   1012.7
    vp9_inv_dct_dct_32x32_sub2_add_10_neon:     13333.7   9695.1
    vp9_inv_dct_dct_32x32_sub16_add_10_neon:    18531.3  12459.8
    vp9_inv_dct_dct_32x32_sub32_add_10_neon:    24470.7  16160.2
    vp9_inv_wht_wht_4x4_sub4_add_10_neon:          83.0     79.7
    
    The larger transforms are significantly faster than the corresponding
    ARM versions.
    
    The speedup vs C code is smaller than in 32 bit mode, probably
    because the 64 bit intermediates in the C code can be expressed
    more efficiently in aarch64.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    ceb36b81
Name
Last commit
Last update
compat Loading commit data...
doc Loading commit data...
libavcodec Loading commit data...
libavdevice Loading commit data...
libavfilter Loading commit data...
libavformat Loading commit data...
libavresample Loading commit data...
libavutil Loading commit data...
libpostproc Loading commit data...
libswresample Loading commit data...
libswscale Loading commit data...
presets Loading commit data...
tests Loading commit data...
tools Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
.travis.yml Loading commit data...
CONTRIBUTING.md Loading commit data...
COPYING.GPLv2 Loading commit data...
COPYING.GPLv3 Loading commit data...
COPYING.LGPLv2.1 Loading commit data...
COPYING.LGPLv3 Loading commit data...
CREDITS Loading commit data...
Changelog Loading commit data...
INSTALL.md Loading commit data...
LICENSE.md Loading commit data...
MAINTAINERS Loading commit data...
Makefile Loading commit data...
README.md Loading commit data...
RELEASE Loading commit data...
arch.mak Loading commit data...
cmdutils.c Loading commit data...
cmdutils.h Loading commit data...
cmdutils_common_opts.h Loading commit data...
cmdutils_opencl.c Loading commit data...
common.mak Loading commit data...
configure Loading commit data...
ffmpeg.c Loading commit data...
ffmpeg.h Loading commit data...
ffmpeg_cuvid.c Loading commit data...
ffmpeg_dxva2.c Loading commit data...
ffmpeg_filter.c Loading commit data...
ffmpeg_opt.c Loading commit data...
ffmpeg_qsv.c Loading commit data...
ffmpeg_vaapi.c Loading commit data...
ffmpeg_vdpau.c Loading commit data...
ffmpeg_videotoolbox.c Loading commit data...
ffplay.c Loading commit data...
ffprobe.c Loading commit data...
ffserver.c Loading commit data...
ffserver_config.c Loading commit data...
ffserver_config.h Loading commit data...
library.mak Loading commit data...
version.sh Loading commit data...