1. 11 Feb, 2017 1 commit
  2. 10 Feb, 2017 11 commits
  3. 09 Feb, 2017 14 commits
    • Timo Rothenpieler's avatar
      nvenc: make gpu indices independent of supported capabilities · a52976c0
      Timo Rothenpieler authored
      Do not allocate a CUDA context for every available gpu.
      Signed-off-by: 's avatarLuca Barbato <lu_zero@gentoo.org>
      a52976c0
    • Derek Buitenhuis's avatar
    • Martin Storsjö's avatar
      0c0b87f1
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible · ed8d2933
      Martin Storsjö authored
      The ld1r is a leftover from the arm version, where this trick is
      beneficial on some cores.
      
      Use a single-lane load where we don't need the semantics of ld1r.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      ed8d2933
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 · a63da451
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 14740 bytes to 24292 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:
      vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
      vp9_inv_dct_dct_16x16_sub2_add_neon:    1051.0
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
      vp9_inv_dct_dct_16x16_sub8_add_neon:    1051.0
      vp9_inv_dct_dct_16x16_sub12_add_neon:   1387.4
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1387.6
      vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
      vp9_inv_dct_dct_32x32_sub2_add_neon:    5198.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:    5198.6
      vp9_inv_dct_dct_32x32_sub8_add_neon:    5196.3
      vp9_inv_dct_dct_32x32_sub12_add_neon:   6183.4
      vp9_inv_dct_dct_32x32_sub16_add_neon:   6174.3
      vp9_inv_dct_dct_32x32_sub20_add_neon:   7151.4
      vp9_inv_dct_dct_32x32_sub24_add_neon:   7145.3
      vp9_inv_dct_dct_32x32_sub28_add_neon:   8119.3
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8118.7
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
      vp9_inv_dct_dct_16x16_sub2_add_neon:     640.8
      vp9_inv_dct_dct_16x16_sub4_add_neon:     639.0
      vp9_inv_dct_dct_16x16_sub8_add_neon:     842.0
      vp9_inv_dct_dct_16x16_sub12_add_neon:   1388.3
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1389.3
      vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
      vp9_inv_dct_dct_32x32_sub2_add_neon:    3685.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:    3685.1
      vp9_inv_dct_dct_32x32_sub8_add_neon:    3684.4
      vp9_inv_dct_dct_32x32_sub12_add_neon:   5312.2
      vp9_inv_dct_dct_32x32_sub16_add_neon:   5315.4
      vp9_inv_dct_dct_32x32_sub20_add_neon:   7154.9
      vp9_inv_dct_dct_32x32_sub24_add_neon:   7154.5
      vp9_inv_dct_dct_32x32_sub28_add_neon:   8126.6
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8127.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      a63da451
    • Martin Storsjö's avatar
      arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible · 5eb5aec4
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This avoids loading and calculating coefficients that we know will
      be zero, and avoids filling the temp buffer with zeros in places
      where we know the second pass won't read.
      
      This gives a pretty substantial speedup for the smaller subpartitions.
      
      The code size increases from 12388 bytes to 19784 bytes.
      
      The idct16/32_end macros are moved above the individual functions; the
      instructions themselves are unchanged, but since new functions are added
      at the same place where the code is moved from, the diff looks rather
      messy.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    212.0    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    2102.1   1521.7   1736.2   1265.8
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2104.5   1533.0   1736.6   1265.5
      vp9_inv_dct_dct_16x16_sub8_add_neon:    2484.8   1828.7   2014.4   1506.5
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2851.2   2117.8   2294.8   1753.2
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3239.4   2408.3   2543.5   1994.9
      vp9_inv_dct_dct_32x32_sub1_add_neon:     758.3    456.7    864.5    553.9
      vp9_inv_dct_dct_32x32_sub2_add_neon:   10776.7   7949.8   8567.7   6819.7
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10865.6   8131.5   8589.6   6816.3
      vp9_inv_dct_dct_32x32_sub8_add_neon:   12053.9   9271.3   9387.7   7564.0
      vp9_inv_dct_dct_32x32_sub12_add_neon:  13328.3  10463.2  10217.0   8321.3
      vp9_inv_dct_dct_32x32_sub16_add_neon:  14176.4  11509.5  11018.7   9062.3
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15301.5  12999.9  11855.1   9828.2
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16482.7  14931.5  12650.1  10575.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17589.5  15811.9  13482.8  11333.4
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18696.2  17049.2  14355.6  12089.7
      
      After:
      vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    211.7    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    1203.5    998.2   1035.3    763.0
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1203.5    998.1   1035.5    760.8
      vp9_inv_dct_dct_16x16_sub8_add_neon:    1926.1   1610.6   1722.1   1271.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2873.2   2129.7   2285.1   1757.3
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3221.4   2520.3   2557.6   2002.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     753.0    457.5    866.6    554.6
      vp9_inv_dct_dct_32x32_sub2_add_neon:    7554.6   5652.4   6048.4   4920.2
      vp9_inv_dct_dct_32x32_sub4_add_neon:    7549.9   5685.0   6046.9   4925.7
      vp9_inv_dct_dct_32x32_sub8_add_neon:    8336.9   6704.5   6604.0   5478.0
      vp9_inv_dct_dct_32x32_sub12_add_neon:  10914.0   9777.2   9240.4   7416.9
      vp9_inv_dct_dct_32x32_sub16_add_neon:  11859.2  11223.3   9966.3   8095.1
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15237.1  13029.4  11838.3   9829.4
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16293.2  14379.8  12644.9  10572.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17424.3  15734.7  13473.0  11326.9
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.3  17457.0  14298.6  12080.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      5eb5aec4
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 79d332eb
      Martin Storsjö authored
      This allows reusing the macro for a separate implementation of the
      pass2 function.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      79d332eb
    • Martin Storsjö's avatar
      arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 47b3c2c1
      Martin Storsjö authored
      This allows reusing the macro for a separate implementation of the
      pass2 function.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      47b3c2c1
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Make the larger core transforms standalone functions · 11547601
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from
      19496 to 14740 bytes.
      
      This gives a small slowdown of a couple of tens of cycles, but makes
      it more feasible to add more optimized versions of these transforms.
      
      Before:
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.2
      vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8095.7
      
      After:
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1390.1
      vp9_inv_dct_dct_32x32_sub4_add_neon:    5199.9
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8125.8
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      11547601
    • Martin Storsjö's avatar
      arm: vp9itxfm: Make the larger core transforms standalone functions · 0331c3f5
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
      15324 to 12388 bytes.
      
      This gives a small slowdown of a couple tens of cycles, up to around
      150 cycles for the full case of the largest transform, but makes
      it more feasible to add more optimized versions of these transforms.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2063.4   1516.0   1719.5   1245.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3279.3   2454.5   2525.2   1982.3
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10750.0   7955.4   8525.6   6754.2
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18574.0  17108.4  14216.7  12010.2
      
      After:
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2060.8   1608.5   1735.7   1262.0
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.2   2443.5   2546.1   1999.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10682.0   8043.8   8581.3   6810.1
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18522.4  17277.4  14286.7  12087.9
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      0331c3f5
  4. 08 Feb, 2017 2 commits
  5. 07 Feb, 2017 3 commits
  6. 06 Feb, 2017 3 commits
  7. 05 Feb, 2017 2 commits
  8. 04 Feb, 2017 1 commit
  9. 03 Feb, 2017 3 commits
    • Diego Biurrun's avatar
      7abdd026
    • Diego Biurrun's avatar
      build: Ignore generated .version files · 740b0bf0
      Diego Biurrun authored
      740b0bf0
    • Martin Storsjö's avatar
      rtmp: Correctly handle the Window Acknowledgement Size packets · 15a92e0c
      Martin Storsjö authored
      This swaps which field is set when the Window Acknowledgement Size
      and Set Peer BW packets are received, renames the fields in
      order to clarify their role further and adds verbose comments
      explaining their respective roles and how well the code currently
      does what it is supposed to.
      
      The Set Peer BW packet tells the receiver of the packet (which
      can be either client or server) that it should not send more data
      if it already has sent more data than the specified number of bytes,
      without receiving acknowledgement for them. Actually checking this
      limit is currently not implemented.
      
      In order to be able to check that properly, one can send the
      Window Acknowledgement Size packet, which tells the receiver of the
      packet that it needs to send Acknowledgement packets
      (RTMP_PT_BYTES_READ) at least after receiving a given number of bytes
      since the last Acknowledgement.
      
      Therefore, when we receive a Window Acknowledgement Size packet,
      this sets the maximum number of bytes we can receive without sending
      an Acknowledgement; therefore when handling this packet we should set
      the receive_report_size field (previously client_report_size).
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      15a92e0c