1. 06 Dec, 2016 3 commits
  2. 05 Dec, 2016 7 commits
  3. 03 Dec, 2016 13 commits
  4. 02 Dec, 2016 7 commits
  5. 01 Dec, 2016 1 commit
  6. 30 Nov, 2016 7 commits
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · cad42fad
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      Previously all subpartitions except the eob=1 (DC) case ran with
      the same runtime:
      
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1373.2
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8089.0
      
      By skipping individual 8x16 or 8x32 pixel slices in the first pass,
      we reduce the runtime of these functions like this:
      
      vp9_inv_dct_dct_16x16_sub1_add_neon:     235.3
      vp9_inv_dct_dct_16x16_sub2_add_neon:    1036.7
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
      vp9_inv_dct_dct_16x16_sub8_add_neon:    1036.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   1372.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     555.1
      vp9_inv_dct_dct_32x32_sub2_add_neon:    5190.2
      vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
      vp9_inv_dct_dct_32x32_sub8_add_neon:    5183.1
      vp9_inv_dct_dct_32x32_sub12_add_neon:   6161.5
      vp9_inv_dct_dct_32x32_sub16_add_neon:   6155.5
      vp9_inv_dct_dct_32x32_sub20_add_neon:   7136.3
      vp9_inv_dct_dct_32x32_sub24_add_neon:   7128.4
      vp9_inv_dct_dct_32x32_sub28_add_neon:   8098.9
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8098.8
      
      I.e. in general a very minor overhead for the full subpartition case due
      to the additional cmps, but a significant speedup for the cases when we
      only need to process a small part of the actual input data.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      cad42fad
    • Martin Storsjö's avatar
      arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · 9c8bc74c
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      Previously all subpartitions except the eob=1 (DC) case ran with
      the same runtime:
      
                                           Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3188.1   2435.4   2499.0   1969.0
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.7  16582.3  14207.6  12000.3
      
      By skipping individual 4x16 or 4x32 pixel slices in the first pass,
      we reduce the runtime of these functions like this:
      
      vp9_inv_dct_dct_16x16_sub1_add_neon:     274.6    189.5    211.7    235.8
      vp9_inv_dct_dct_16x16_sub2_add_neon:    2064.0   1534.8   1719.4   1248.7
      vp9_inv_dct_dct_16x16_sub4_add_neon:    2135.0   1477.2   1736.3   1249.5
      vp9_inv_dct_dct_16x16_sub8_add_neon:    2446.7   1828.7   1993.6   1494.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   2832.4   2118.3   2266.5   1735.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.7   2475.3   2523.5   1983.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     756.2    456.7    862.0    553.9
      vp9_inv_dct_dct_32x32_sub2_add_neon:   10682.2   8190.4   8539.2   6762.5
      vp9_inv_dct_dct_32x32_sub4_add_neon:   10813.5   8014.9   8518.3   6762.8
      vp9_inv_dct_dct_32x32_sub8_add_neon:   11859.6   9313.0   9347.4   7514.5
      vp9_inv_dct_dct_32x32_sub12_add_neon:  12946.6  10752.4  10192.2   8280.2
      vp9_inv_dct_dct_32x32_sub16_add_neon:  14074.6  11946.5  11001.4   9008.6
      vp9_inv_dct_dct_32x32_sub20_add_neon:  15269.9  13662.7  11816.1   9762.6
      vp9_inv_dct_dct_32x32_sub24_add_neon:  16327.9  14940.1  12626.7  10516.0
      vp9_inv_dct_dct_32x32_sub28_add_neon:  17462.7  15776.1  13446.2  11264.7
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18575.5  17157.0  14249.3  12015.1
      
      I.e. in general a very minor overhead for the full subpartition case due
      to the additional loads and cmps, but a significant speedup for the cases
      when we only need to process a small part of the actual input data.
      
      In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
      16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
      8x8 or 16x16 subpartitions respectively.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      9c8bc74c
    • Martin Storsjö's avatar
      arm: vp9itxfm: Only reload the idct coeffs for the iadst_idct combination · 3c87039a
      Martin Storsjö authored
      This avoids reloading them if they haven't been clobbered, if the
      first pass also was idct.
      
      This is similar to what was done in the aarch64 version.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3c87039a
    • Clément Bœsch's avatar
      vp9dsp: add DC only versions for idct/idct. · c4c5f538
      Clément Bœsch authored
      before:
      
      time ./avconv -v 0 -nostats -threads 1 -i sintel_vp9_500kbps.webm -f null -
      real    0m11.125s
      user    0m11.059s
      sys     0m0.050s
      
      time ./avconv -v 0 -nostats -threads 1 -i sintel_vp9_500kbps.webm -f null -
      real    0m10.944s
      user    0m10.819s
      sys     0m0.064s
      
      after:
      
      time ./avconv -v 0 -nostats -threads 1 -i sintel_vp9_500kbps.webm -f null -
      real    0m8.153s
      user    0m8.034s
      sys     0m0.050s
      
      time ./avconv -v 0 -nostats -threads 1 -i sintel_vp9_500kbps.webm -f null -
      real    0m8.038s
      user    0m7.980s
      sys     0m0.039s
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      c4c5f538
    • Diego Biurrun's avatar
      e4382a4a
    • Diego Biurrun's avatar
      hevc: Drop pointless av_unused attribute · 5c890225
      Diego Biurrun authored
      5c890225
    • Diego Biurrun's avatar
      metasound: Drop unused tables · 0983f911
      Diego Biurrun authored
      0983f911
  7. 29 Nov, 2016 2 commits