1. 11 Feb, 2017 13 commits
  2. 10 Feb, 2017 11 commits
  3. 09 Feb, 2017 14 commits
    • Timo Rothenpieler's avatar
      nvenc: make gpu indices independent of supported capabilities · a52976c0
      Timo Rothenpieler authored
      Do not allocate a CUDA context for every available gpu.
      Signed-off-by: 's avatarLuca Barbato <lu_zero@gentoo.org>
      a52976c0
    • Derek Buitenhuis's avatar
    • Martin Storsjö's avatar
      0c0b87f1
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible · ed8d2933
      Martin Storsjö authored
      The ld1r is a leftover from the arm version, where this trick is
      beneficial on some cores.
      
      Use a single-lane load where we don't need the semantics of ld1r.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      ed8d2933
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 · a63da451
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 14740 bytes to 24292 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:
      vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
      vp9_inv_dct_dct_16x16_sub2_add_neon:    1051.0
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
      vp9_inv_dct_dct_16x16_sub8_add_neon:    1051.0
      vp9_inv_dct_dct_16x16_sub12_add_neon:   1387.4
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1387.6
      vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
      vp9_inv_dct_dct_32x32_sub2_add_neon:    5198.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:    5198.6
      vp9_inv_dct_dct_32x32_sub8_add_neon:    5196.3
      vp9_inv_dct_dct_32x32_sub12_add_neon:   6183.4
      vp9_inv_dct_dct_32x32_sub16_add_neon:   6174.3
      vp9_inv_dct_dct_32x32_sub20_add_neon:   7151.4
      vp9_inv_dct_dct_32x32_sub24_add_neon:   7145.3
      vp9_inv_dct_dct_32x32_sub28_add_neon:   8119.3
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8118.7
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
      vp9_inv_dct_dct_16x16_sub2_add_neon:     640.8
      vp9_inv_dct_dct_16x16_sub4_add_neon:     639.0
      vp9_inv_dct_dct_16x16_sub8_add_neon:     842.0
      vp9_inv_dct_dct_16x16_sub12_add_neon:   1388.3
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1389.3
      vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
      vp9_inv_dct_dct_32x32_sub2_add_neon:    3685.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:    3685.1
      vp9_inv_dct_dct_32x32_sub8_add_neon:    3684.4
      vp9_inv_dct_dct_32x32_sub12_add_neon:   5312.2
      vp9_inv_dct_dct_32x32_sub16_add_neon:   5315.4
      vp9_inv_dct_dct_32x32_sub20_add_neon:   7154.9
      vp9_inv_dct_dct_32x32_sub24_add_neon:   7154.5
      vp9_inv_dct_dct_32x32_sub28_add_neon:   8126.6
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8127.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      a63da451
    • Martin Storsjö's avatar
      arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible · 5eb5aec4
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 12388 bytes to 19784 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    212.0    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    2102.1   1521.7   1736.2   1265.8
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2104.5   1533.0   1736.6   1265.5
      vp9_inv_dct_dct_16x16_sub8_add_neon:    2484.8   1828.7   2014.4   1506.5
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2851.2   2117.8   2294.8   1753.2
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3239.4   2408.3   2543.5   1994.9
      vp9_inv_dct_dct_32x32_sub1_add_neon:     758.3    456.7    864.5    553.9
      vp9_inv_dct_dct_32x32_sub2_add_neon:   10776.7   7949.8   8567.7   6819.7
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10865.6   8131.5   8589.6   6816.3
      vp9_inv_dct_dct_32x32_sub8_add_neon:   12053.9   9271.3   9387.7   7564.0
      vp9_inv_dct_dct_32x32_sub12_add_neon:  13328.3  10463.2  10217.0   8321.3
      vp9_inv_dct_dct_32x32_sub16_add_neon:  14176.4  11509.5  11018.7   9062.3
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15301.5  12999.9  11855.1   9828.2
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16482.7  14931.5  12650.1  10575.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17589.5  15811.9  13482.8  11333.4
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18696.2  17049.2  14355.6  12089.7
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    211.7    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    1203.5    998.2   1035.3    763.0
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1203.5    998.1   1035.5    760.8
      vp9_inv_dct_dct_16x16_sub8_add_neon:    1926.1   1610.6   1722.1   1271.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2873.2   2129.7   2285.1   1757.3
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3221.4   2520.3   2557.6   2002.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     753.0    457.5    866.6    554.6
      vp9_inv_dct_dct_32x32_sub2_add_neon:    7554.6   5652.4   6048.4   4920.2
      vp9_inv_dct_dct_32x32_sub4_add_neon:    7549.9   5685.0   6046.9   4925.7
      vp9_inv_dct_dct_32x32_sub8_add_neon:    8336.9   6704.5   6604.0   5478.0
      vp9_inv_dct_dct_32x32_sub12_add_neon:  10914.0   9777.2   9240.4   7416.9
      vp9_inv_dct_dct_32x32_sub16_add_neon:  11859.2  11223.3   9966.3   8095.1
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15237.1  13029.4  11838.3   9829.4
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16293.2  14379.8  12644.9  10572.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17424.3  15734.7  13473.0  11326.9
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.3  17457.0  14298.6  12080.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      5eb5aec4
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 79d332eb
      Martin Storsjö authored
      This allows reusing the macro for a separate implementation of the
      pass2 function.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      79d332eb
    • Martin Storsjö's avatar
      arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 47b3c2c1
      Martin Storsjö authored
      This allows reusing the macro for a separate implementation of the
      pass2 function.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      47b3c2c1
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Make the larger core transforms standalone functions · 11547601
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from
      19496 to 14740 bytes.
      
      This gives a small slowdown of a couple of tens of cycles, but makes
      it more feasible to add more optimized versions of these transforms.
      
      Before:
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.2
      vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8095.7
      
      After:
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1390.1
      vp9_inv_dct_dct_32x32_sub4_add_neon:    5199.9
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8125.8
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      11547601
    • Martin Storsjö's avatar
      arm: vp9itxfm: Make the larger core transforms standalone functions · 0331c3f5
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
      15324 to 12388 bytes.
      
      This gives a small slowdown of a couple tens of cycles, up to around
      150 cycles for the full case of the largest transform, but makes
      it more feasible to add more optimized versions of these transforms.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2063.4   1516.0   1719.5   1245.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3279.3   2454.5   2525.2   1982.3
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10750.0   7955.4   8525.6   6754.2
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18574.0  17108.4  14216.7  12010.2
      
      After:
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2060.8   1608.5   1735.7   1262.0
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.2   2443.5   2546.1   1999.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10682.0   8043.8   8581.3   6810.1
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18522.4  17277.4  14286.7  12087.9
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      0331c3f5
  4. 08 Feb, 2017 2 commits