• Ronald S. Bultje's avatar
    vp9: add 32x32 idct AVX2 implementation. · 726501a3
    Ronald S. Bultje authored
    About 1.8x speedup compared to AVX version for full IDCT. Other
    sub-IDCT scenarios also see speedups. Full --bench output for
    idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles):
    
    nop: 16.5
    vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4
    vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0
    vp9_inv_dct_dct_32x32_add_8_1_ssse3: 137.4
    vp9_inv_dct_dct_32x32_add_8_1_avx: 137.1
    vp9_inv_dct_dct_32x32_add_8_1_avx2: 73.2
    vp9_inv_dct_dct_32x32_add_8_2_c: 14680.8
    vp9_inv_dct_dct_32x32_add_8_2_sse2: 2617.2
    vp9_inv_dct_dct_32x32_add_8_2_ssse3: 982.9
    vp9_inv_dct_dct_32x32_add_8_2_avx: 958.5
    vp9_inv_dct_dct_32x32_add_8_2_avx2: 704.2
    vp9_inv_dct_dct_32x32_add_8_4_c: 14443.1
    vp9_inv_dct_dct_32x32_add_8_4_sse2: 2717.1
    vp9_inv_dct_dct_32x32_add_8_4_ssse3: 965.7
    vp9_inv_dct_dct_32x32_add_8_4_avx: 1000.7
    vp9_inv_dct_dct_32x32_add_8_4_avx2: 717.1
    vp9_inv_dct_dct_32x32_add_8_8_c: 14436.4
    vp9_inv_dct_dct_32x32_add_8_8_sse2: 2671.8
    vp9_inv_dct_dct_32x32_add_8_8_ssse3: 1038.5
    vp9_inv_dct_dct_32x32_add_8_8_avx: 983.0
    vp9_inv_dct_dct_32x32_add_8_8_avx2: 729.4
    vp9_inv_dct_dct_32x32_add_8_16_c: 14614.7
    vp9_inv_dct_dct_32x32_add_8_16_sse2: 2701.7
    vp9_inv_dct_dct_32x32_add_8_16_ssse3: 1334.4
    vp9_inv_dct_dct_32x32_add_8_16_avx: 1276.7
    vp9_inv_dct_dct_32x32_add_8_16_avx2: 719.5
    vp9_inv_dct_dct_32x32_add_8_32_c: 14363.6
    vp9_inv_dct_dct_32x32_add_8_32_sse2: 2575.6
    vp9_inv_dct_dct_32x32_add_8_32_ssse3: 2633.9
    vp9_inv_dct_dct_32x32_add_8_32_avx: 2539.6
    vp9_inv_dct_dct_32x32_add_8_32_avx2: 1395.0
    726501a3
vp9itxfm.asm 103 KB