• Martin Storsjö's avatar
    arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · 9c8bc74c
    Martin Storsjö authored
    This work is sponsored by, and copyright, Google.
    
    Previously all subpartitions except the eob=1 (DC) case ran with
    the same runtime:
    
                                         Cortex A7       A8       A9      A53
    vp9_inv_dct_dct_16x16_sub16_add_neon:   3188.1   2435.4   2499.0   1969.0
    vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.7  16582.3  14207.6  12000.3
    
    By skipping individual 4x16 or 4x32 pixel slices in the first pass,
    we reduce the runtime of these functions like this:
    
    vp9_inv_dct_dct_16x16_sub1_add_neon:     274.6    189.5    211.7    235.8
    vp9_inv_dct_dct_16x16_sub2_add_neon:    2064.0   1534.8   1719.4   1248.7
    vp9_inv_dct_dct_16x16_sub4_add_neon:    2135.0   1477.2   1736.3   1249.5
    vp9_inv_dct_dct_16x16_sub8_add_neon:    2446.7   1828.7   1993.6   1494.7
    vp9_inv_dct_dct_16x16_sub12_add_neon:   2832.4   2118.3   2266.5   1735.1
    vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.7   2475.3   2523.5   1983.1
    vp9_inv_dct_dct_32x32_sub1_add_neon:     756.2    456.7    862.0    553.9
    vp9_inv_dct_dct_32x32_sub2_add_neon:   10682.2   8190.4   8539.2   6762.5
    vp9_inv_dct_dct_32x32_sub4_add_neon:   10813.5   8014.9   8518.3   6762.8
    vp9_inv_dct_dct_32x32_sub8_add_neon:   11859.6   9313.0   9347.4   7514.5
    vp9_inv_dct_dct_32x32_sub12_add_neon:  12946.6  10752.4  10192.2   8280.2
    vp9_inv_dct_dct_32x32_sub16_add_neon:  14074.6  11946.5  11001.4   9008.6
    vp9_inv_dct_dct_32x32_sub20_add_neon:  15269.9  13662.7  11816.1   9762.6
    vp9_inv_dct_dct_32x32_sub24_add_neon:  16327.9  14940.1  12626.7  10516.0
    vp9_inv_dct_dct_32x32_sub28_add_neon:  17462.7  15776.1  13446.2  11264.7
    vp9_inv_dct_dct_32x32_sub32_add_neon:  18575.5  17157.0  14249.3  12015.1
    
    I.e. in general a very minor overhead for the full subpartition case due
    to the additional loads and cmps, but a significant speedup for the cases
    when we only need to process a small part of the actual input data.
    
    In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
    16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
    8x8 or 16x16 subpartitions respectively.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    9c8bc74c
Name
Last commit
Last update
compat Loading commit data...
doc Loading commit data...
libavcodec Loading commit data...
libavdevice Loading commit data...
libavfilter Loading commit data...
libavformat Loading commit data...
libavresample Loading commit data...
libavutil Loading commit data...
libswscale Loading commit data...
presets Loading commit data...
tests Loading commit data...
tools Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
.travis.yml Loading commit data...
COPYING.GPLv2 Loading commit data...
COPYING.GPLv3 Loading commit data...
COPYING.LGPLv2.1 Loading commit data...
COPYING.LGPLv3 Loading commit data...
CREDITS Loading commit data...
Changelog Loading commit data...
INSTALL Loading commit data...
LICENSE Loading commit data...
Makefile Loading commit data...
README Loading commit data...
README.md Loading commit data...
RELEASE Loading commit data...
arch.mak Loading commit data...
avconv.c Loading commit data...
avconv.h Loading commit data...
avconv_dxva2.c Loading commit data...
avconv_filter.c Loading commit data...
avconv_opt.c Loading commit data...
avconv_qsv.c Loading commit data...
avconv_vaapi.c Loading commit data...
avconv_vda.c Loading commit data...
avconv_vdpau.c Loading commit data...
avplay.c Loading commit data...
avprobe.c Loading commit data...
cmdutils.c Loading commit data...
cmdutils.h Loading commit data...
cmdutils_common_opts.h Loading commit data...
common.mak Loading commit data...
configure Loading commit data...
library.mak Loading commit data...
version.sh Loading commit data...