1. 05 Mar, 2017 1 commit
  2. 04 Mar, 2017 1 commit
  3. 01 Mar, 2017 13 commits
  4. 28 Feb, 2017 5 commits
  5. 27 Feb, 2017 7 commits
  6. 25 Feb, 2017 4 commits
  7. 24 Feb, 2017 3 commits
  8. 23 Feb, 2017 6 commits
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Reorder iadst16 coeffs · b8f66c08
      Martin Storsjö authored
      This matches the order they are in the 16 bpp version.
      
      There they are in this order, to make sure we access them in the
      same order they are declared, easing loading only half of the
      coefficients at a time.
      
      This makes the 8 bpp version match the 16 bpp version better.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b8f66c08
    • Martin Storsjö's avatar
      arm: vp9itxfm: Reorder iadst16 coeffs · 08074c09
      Martin Storsjö authored
      This matches the order they are in the 16 bpp version.
      
      There they are in this order, to make sure we access them in the
      same order they are declared, easing loading only half of the
      coefficients at a time.
      
      This makes the 8 bpp version match the 16 bpp version better.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      08074c09
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Reorder the idct coefficients for better pairing · 09eb88a1
      Martin Storsjö authored
      All elements are used pairwise, except for the first one.
      Previously, the 16th element was unused. Move the unused element
      to the second slot, to make the later element pairs not split
      across registers.
      
      This simplifies loading only parts of the coefficients,
      reducing the difference to the 16 bpp version.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      09eb88a1
    • Martin Storsjö's avatar
      arm: vp9itxfm: Reorder the idct coefficients for better pairing · de06bdfe
      Martin Storsjö authored
      All elements are used pairwise, except for the first one.
      Previously, the 16th element was unused. Move the unused element
      to the second slot, to make the later element pairs not split
      across registers.
      
      This simplifies loading only parts of the coefficients,
      reducing the difference to the 16 bpp version.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      de06bdfe
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Avoid reloading the idct32 coefficients · 65aa002d
      Martin Storsjö authored
      The idct32x32 function actually pushed d8-d15 onto the stack even
      though it didn't clobber them; there are plenty of registers that
      can be used to allow keeping all the idct coefficients in registers
      without having to reload different subsets of them at different
      stages in the transform.
      
      After this, we still can skip pushing d12-d15.
      
      Before:
      vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
      After:
      vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      65aa002d
    • Martin Storsjö's avatar
      arm: vp9itxfm: Avoid reloading the idct32 coefficients · 402546a1
      Martin Storsjö authored
      The idct32x32 function actually pushed q4-q7 onto the stack even
      though it didn't clobber them; there are plenty of registers that
      can be used to allow keeping all the idct coefficients in registers
      without having to reload different subsets of them at different
      stages in the transform.
      
      Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
      q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
      in the idct16 function), and the lanewise vmul needs a register in
      the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
      while doing idct16.
      
      While keeping these coefficients in registers, we still can skip pushing
      q7.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
      After:
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      402546a1