1. 07 Dec, 2016 6 commits
  2. 06 Dec, 2016 5 commits
  3. 05 Dec, 2016 7 commits
  4. 03 Dec, 2016 13 commits
  5. 02 Dec, 2016 7 commits
  6. 01 Dec, 2016 1 commit
  7. 30 Nov, 2016 1 commit
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · cad42fad
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      Previously all subpartitions except the eob=1 (DC) case ran with
      the same runtime:
      
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1373.2
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8089.0
      
      By skipping individual 8x16 or 8x32 pixel slices in the first pass,
      we reduce the runtime of these functions like this:
      
      vp9_inv_dct_dct_16x16_sub1_add_neon:     235.3
      vp9_inv_dct_dct_16x16_sub2_add_neon:    1036.7
      vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
      vp9_inv_dct_dct_16x16_sub8_add_neon:    1036.7
      vp9_inv_dct_dct_16x16_sub12_add_neon:   1372.1
      vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.1
      vp9_inv_dct_dct_32x32_sub1_add_neon:     555.1
      vp9_inv_dct_dct_32x32_sub2_add_neon:    5190.2
      vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
      vp9_inv_dct_dct_32x32_sub8_add_neon:    5183.1
      vp9_inv_dct_dct_32x32_sub12_add_neon:   6161.5
      vp9_inv_dct_dct_32x32_sub16_add_neon:   6155.5
      vp9_inv_dct_dct_32x32_sub20_add_neon:   7136.3
      vp9_inv_dct_dct_32x32_sub24_add_neon:   7128.4
      vp9_inv_dct_dct_32x32_sub28_add_neon:   8098.9
      vp9_inv_dct_dct_32x32_sub32_add_neon:   8098.8
      
      I.e. in general a very minor overhead for the full subpartition case due
      to the additional cmps, but a significant speedup for the cases when we
      only need to process a small part of the actual input data.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      cad42fad