1. 13 Feb, 2017 3 commits
  2. 12 Feb, 2017 1 commit
  3. 11 Feb, 2017 13 commits
  4. 10 Feb, 2017 11 commits
  5. 09 Feb, 2017 12 commits
    • Timo Rothenpieler's avatar
      nvenc: make gpu indices independent of supported capabilities · a52976c0
      Timo Rothenpieler authored
      Do not allocate a CUDA context for every available gpu.
      Signed-off-by: 's avatarLuca Barbato <lu_zero@gentoo.org>
      a52976c0
    • Derek Buitenhuis's avatar
    • Martin Storsjö's avatar
      0c0b87f1
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible · ed8d2933
      Martin Storsjö authored
      The ld1r is a leftover from the arm version, where this trick is
      beneficial on some cores.
      
      Use a single-lane load where we don't need the semantics of ld1r.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      ed8d2933
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 · a63da451
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 14740 bytes to 24292 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:
      vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
      vp9_inv_dct_dct_16x16_sub2_add_neon:    1051.0
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
      vp9_inv_dct_dct_16x16_sub8_add_neon:    1051.0
      vp9_inv_dct_dct_16x16_sub12_add_neon:   1387.4
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1387.6
      vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
      vp9_inv_dct_dct_32x32_sub2_add_neon:    5198.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:    5198.6
      vp9_inv_dct_dct_32x32_sub8_add_neon:    5196.3
      vp9_inv_dct_dct_32x32_sub12_add_neon:   6183.4
      vp9_inv_dct_dct_32x32_sub16_add_neon:   6174.3
      vp9_inv_dct_dct_32x32_sub20_add_neon:   7151.4
      vp9_inv_dct_dct_32x32_sub24_add_neon:   7145.3
      vp9_inv_dct_dct_32x32_sub28_add_neon:   8119.3
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8118.7
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
      vp9_inv_dct_dct_16x16_sub2_add_neon:     640.8
      vp9_inv_dct_dct_16x16_sub4_add_neon:     639.0
      vp9_inv_dct_dct_16x16_sub8_add_neon:     842.0
      vp9_inv_dct_dct_16x16_sub12_add_neon:   1388.3
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1389.3
      vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
      vp9_inv_dct_dct_32x32_sub2_add_neon:    3685.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:    3685.1
      vp9_inv_dct_dct_32x32_sub8_add_neon:    3684.4
      vp9_inv_dct_dct_32x32_sub12_add_neon:   5312.2
      vp9_inv_dct_dct_32x32_sub16_add_neon:   5315.4
      vp9_inv_dct_dct_32x32_sub20_add_neon:   7154.9
      vp9_inv_dct_dct_32x32_sub24_add_neon:   7154.5
      vp9_inv_dct_dct_32x32_sub28_add_neon:   8126.6
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8127.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      a63da451
    • Martin Storsjö's avatar
      arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible · 5eb5aec4
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 12388 bytes to 19784 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    212.0    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    2102.1   1521.7   1736.2   1265.8
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2104.5   1533.0   1736.6   1265.5
      vp9_inv_dct_dct_16x16_sub8_add_neon:    2484.8   1828.7   2014.4   1506.5
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2851.2   2117.8   2294.8   1753.2
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3239.4   2408.3   2543.5   1994.9
      vp9_inv_dct_dct_32x32_sub1_add_neon:     758.3    456.7    864.5    553.9
      vp9_inv_dct_dct_32x32_sub2_add_neon:   10776.7   7949.8   8567.7   6819.7
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10865.6   8131.5   8589.6   6816.3
      vp9_inv_dct_dct_32x32_sub8_add_neon:   12053.9   9271.3   9387.7   7564.0
      vp9_inv_dct_dct_32x32_sub12_add_neon:  13328.3  10463.2  10217.0   8321.3
      vp9_inv_dct_dct_32x32_sub16_add_neon:  14176.4  11509.5  11018.7   9062.3
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15301.5  12999.9  11855.1   9828.2
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16482.7  14931.5  12650.1  10575.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17589.5  15811.9  13482.8  11333.4
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18696.2  17049.2  14355.6  12089.7
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    211.7    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    1203.5    998.2   1035.3    763.0
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1203.5    998.1   1035.5    760.8
      vp9_inv_dct_dct_16x16_sub8_add_neon:    1926.1   1610.6   1722.1   1271.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2873.2   2129.7   2285.1   1757.3
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3221.4   2520.3   2557.6   2002.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     753.0    457.5    866.6    554.6
      vp9_inv_dct_dct_32x32_sub2_add_neon:    7554.6   5652.4   6048.4   4920.2
      vp9_inv_dct_dct_32x32_sub4_add_neon:    7549.9   5685.0   6046.9   4925.7
      vp9_inv_dct_dct_32x32_sub8_add_neon:    8336.9   6704.5   6604.0   5478.0
      vp9_inv_dct_dct_32x32_sub12_add_neon:  10914.0   9777.2   9240.4   7416.9
      vp9_inv_dct_dct_32x32_sub16_add_neon:  11859.2  11223.3   9966.3   8095.1
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15237.1  13029.4  11838.3   9829.4
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16293.2  14379.8  12644.9  10572.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17424.3  15734.7  13473.0  11326.9
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.3  17457.0  14298.6  12080.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      5eb5aec4
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 79d332eb
      Martin Storsjö authored
      This allows reusing the macro for a separate implementation of the
      pass2 function.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      79d332eb
    • Martin Storsjö's avatar
      arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 47b3c2c1
      Martin Storsjö authored
      This allows reusing the macro for a separate implementation of the
      pass2 function.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      47b3c2c1