1. 01 Mar, 2017 12 commits
  2. 28 Feb, 2017 5 commits
  3. 27 Feb, 2017 7 commits
  4. 25 Feb, 2017 4 commits
  5. 24 Feb, 2017 3 commits
  6. 23 Feb, 2017 9 commits
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Reorder iadst16 coeffs · b8f66c08
      Martin Storsjö authored
      This matches the order they are in the 16 bpp version.
      
      There they are in this order, to make sure we access them in the
      same order they are declared, easing loading only half of the
      coefficients at a time.
      
      This makes the 8 bpp version match the 16 bpp version better.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b8f66c08
    • Martin Storsjö's avatar
      arm: vp9itxfm: Reorder iadst16 coeffs · 08074c09
      Martin Storsjö authored
      This matches the order they are in the 16 bpp version.
      
      There they are in this order, to make sure we access them in the
      same order they are declared, easing loading only half of the
      coefficients at a time.
      
      This makes the 8 bpp version match the 16 bpp version better.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      08074c09
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Reorder the idct coefficients for better pairing · 09eb88a1
      Martin Storsjö authored
      All elements are used pairwise, except for the first one.
      Previously, the 16th element was unused. Move the unused element
      to the second slot, to make the later element pairs not split
      across registers.
      
      This simplifies loading only parts of the coefficients,
      reducing the difference to the 16 bpp version.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      09eb88a1
    • Martin Storsjö's avatar
      arm: vp9itxfm: Reorder the idct coefficients for better pairing · de06bdfe
      Martin Storsjö authored
      All elements are used pairwise, except for the first one.
      Previously, the 16th element was unused. Move the unused element
      to the second slot, to make the later element pairs not split
      across registers.
      
      This simplifies loading only parts of the coefficients,
      reducing the difference to the 16 bpp version.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      de06bdfe
    • Martin Storsjö's avatar
      aarch64: vp9itxfm: Avoid reloading the idct32 coefficients · 65aa002d
      Martin Storsjö authored
      The idct32x32 function actually pushed d8-d15 onto the stack even
      though it didn't clobber them; there are plenty of registers that
      can be used to allow keeping all the idct coefficients in registers
      without having to reload different subsets of them at different
      stages in the transform.
      
      After this, we still can skip pushing d12-d15.
      
      Before:
      vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
      After:
      vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      65aa002d
    • Martin Storsjö's avatar
      arm: vp9itxfm: Avoid reloading the idct32 coefficients · 402546a1
      Martin Storsjö authored
      The idct32x32 function actually pushed q4-q7 onto the stack even
      though it didn't clobber them; there are plenty of registers that
      can be used to allow keeping all the idct coefficients in registers
      without having to reload different subsets of them at different
      stages in the transform.
      
      Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
      q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
      in the idct16 function), and the lanewise vmul needs a register in
      the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
      while doing idct16.
      
      While keeping these coefficients in registers, we still can skip pushing
      q7.
      
      Before:                              Cortex A7       A8       A9      A53
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
      After:
      vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      402546a1
    • Martin Storsjö's avatar
      arm: vp9lpf: Implement the mix2_44 function with one single filter pass · 575e31e9
      Martin Storsjö authored
      For this case, with 8 inputs but only changing 4 of them, we can fit
      all 16 input pixels into a q register, and still have enough temporary
      registers for doing the loop filter.
      
      The wd=8 filters would require too many temporary registers for
      processing all 16 pixels at once though.
      
      Before:                          Cortex A7      A8     A9     A53
      vp9_loop_filter_mix2_v_44_16_neon:   289.7   256.2  237.5   181.2
      After:
      vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  177.7   138.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      575e31e9
    • Martin Storsjö's avatar
      aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1 · 3bf9c483
      Martin Storsjö authored
      This is one cycle faster in total, and three instructions fewer.
      
      Before:
      vp9_loop_filter_mix2_v_44_16_neon: 123.2
      After:
      vp9_loop_filter_mix2_v_44_16_neon: 122.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3bf9c483
    • Martin Storsjö's avatar
      arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit · c582cb85
      Martin Storsjö authored
      The theoretical maximum value of E is 193, so we can just
      saturate the addition to 255.
      
      Before:                     Cortex A7      A8      A9     A53  A53/AArch64
      vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
      vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
      After:
      vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
      vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
      vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
      vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      c582cb85