1. 27 Mar, 2017 1 commit
  2. 14 Jan, 2017 1 commit
    • Martin Storsjö's avatar
      arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · 388f6e67
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      Previously all subpartitions except the eob=1 (DC) case ran with
      the same runtime:
      
                                           Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3188.1   2435.4   2499.0   1969.0
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.7  16582.3  14207.6  12000.3
      
      By skipping individual 4x16 or 4x32 pixel slices in the first pass,
      we reduce the runtime of these functions like this:
      
      vp9_inv_dct_dct_16x16_sub1_add_neon:     274.6    189.5    211.7    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    2064.0   1534.8   1719.4   1248.7
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2135.0   1477.2   1736.3   1249.5
      vp9_inv_dct_dct_16x16_sub8_add_neon:    2446.7   1828.7   1993.6   1494.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2832.4   2118.3   2266.5   1735.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.7   2475.3   2523.5   1983.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     756.2    456.7    862.0    553.9
      vp9_inv_dct_dct_32x32_sub2_add_neon:   10682.2   8190.4   8539.2   6762.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10813.5   8014.9   8518.3   6762.8
      vp9_inv_dct_dct_32x32_sub8_add_neon:   11859.6   9313.0   9347.4   7514.5
      vp9_inv_dct_dct_32x32_sub12_add_neon:  12946.6  10752.4  10192.2   8280.2
      vp9_inv_dct_dct_32x32_sub16_add_neon:  14074.6  11946.5  11001.4   9008.6
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15269.9  13662.7  11816.1   9762.6
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16327.9  14940.1  12626.7  10516.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17462.7  15776.1  13446.2  11264.7
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18575.5  17157.0  14249.3  12015.1
      
      I.e. in general a very minor overhead for the full subpartition case due
      to the additional loads and cmps, but a significant speedup for the cases
      when we only need to process a small part of the actual input data.
      
      In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
      16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
      8x8 or 16x16 subpartitions respectively.
      
      This is cherrypicked from libav commit
      9c8bc74c.
      Signed-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
      388f6e67
  3. 27 Dec, 2016 1 commit
  4. 30 Nov, 2016 1 commit
    • Martin Storsjö's avatar
      arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · 9c8bc74c
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      Previously all subpartitions except the eob=1 (DC) case ran with
      the same runtime:
      
                                           Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3188.1   2435.4   2499.0   1969.0
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.7  16582.3  14207.6  12000.3
      
      By skipping individual 4x16 or 4x32 pixel slices in the first pass,
      we reduce the runtime of these functions like this:
      
      vp9_inv_dct_dct_16x16_sub1_add_neon:     274.6    189.5    211.7    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    2064.0   1534.8   1719.4   1248.7
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2135.0   1477.2   1736.3   1249.5
      vp9_inv_dct_dct_16x16_sub8_add_neon:    2446.7   1828.7   1993.6   1494.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2832.4   2118.3   2266.5   1735.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.7   2475.3   2523.5   1983.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     756.2    456.7    862.0    553.9
      vp9_inv_dct_dct_32x32_sub2_add_neon:   10682.2   8190.4   8539.2   6762.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10813.5   8014.9   8518.3   6762.8
      vp9_inv_dct_dct_32x32_sub8_add_neon:   11859.6   9313.0   9347.4   7514.5
      vp9_inv_dct_dct_32x32_sub12_add_neon:  12946.6  10752.4  10192.2   8280.2
      vp9_inv_dct_dct_32x32_sub16_add_neon:  14074.6  11946.5  11001.4   9008.6
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15269.9  13662.7  11816.1   9762.6
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16327.9  14940.1  12626.7  10516.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17462.7  15776.1  13446.2  11264.7
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18575.5  17157.0  14249.3  12015.1
      
      I.e. in general a very minor overhead for the full subpartition case due
      to the additional loads and cmps, but a significant speedup for the cases
      when we only need to process a small part of the actual input data.
      
      In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
      16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
      8x8 or 16x16 subpartitions respectively.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      9c8bc74c
  5. 23 Nov, 2016 2 commits
  6. 16 Nov, 2016 1 commit
  7. 11 Nov, 2016 1 commit
  8. 03 Nov, 2016 1 commit
  9. 04 Oct, 2016 1 commit
    • Ronald S. Bultje's avatar
      checkasm: add VP9 loopfilter tests. · c935b54b
      Ronald S. Bultje authored
      The randomize_buffer() implementation assures that "most of the time",
      we'll do a good mix of wide16/wide8/hev/regular/no filters for complete
      code coverage. However, this is not mathematically assured because that
      would make the code either much more complex, or much less random.
      
      Some fixes and improvements by Rodger Combs <rodger.combs@gmail.com>
      Signed-off-by: 's avatarAnton Khirnov <anton@khirnov.net>
      c935b54b
  10. 03 Aug, 2016 1 commit
  11. 27 Jul, 2016 1 commit
  12. 13 Oct, 2015 1 commit
    • Ronald S. Bultje's avatar
      vp9: add itxfm_add eob shortcuts to 10/12bpp functions. · eb4b5ff7
      Ronald S. Bultje authored
      These aren't quite as helpful as the ones in 8bpp, since over there,
      we can use pmulhrsw, but here the coefficients have too many bits to
      be able to take advantage of pmulhrsw. However, we can still skip
      cols for which all coefs are 0, and instead just zero the input data
      for the row itx. This helps a few % on overall decoding speed.
      eb4b5ff7
  13. 28 Sep, 2015 2 commits
  14. 26 Sep, 2015 3 commits
  15. 24 Sep, 2015 1 commit
  16. 22 Sep, 2015 1 commit
  17. 20 Sep, 2015 3 commits
  18. 16 Sep, 2015 1 commit
  19. 15 Sep, 2015 2 commits