1. 11 Mar, 2017 5 commits
  2. 23 Feb, 2017 2 commits
    • Martin Storsjö's avatar
      aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1 · 3bf9c483
      Martin Storsjö authored
      This is one cycle faster in total, and three instructions fewer.
      
      Before:
      vp9_loop_filter_mix2_v_44_16_neon: 123.2
      After:
      vp9_loop_filter_mix2_v_44_16_neon: 122.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3bf9c483
    • Martin Storsjö's avatar
      arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit · c582cb85
      Martin Storsjö authored
      The theoretical maximum value of E is 193, so we can just
      saturate the addition to 255.
      
      Before:                     Cortex A7      A8      A9     A53  A53/AArch64
      vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
      vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
      After:
      vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
      vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
      vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
      vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      c582cb85
  3. 12 Feb, 2017 1 commit
  4. 11 Feb, 2017 1 commit
  5. 10 Feb, 2017 1 commit
    • Martin Storsjö's avatar
      arm/aarch64: vp9lpf: Calculate !hev directly · e1f9de86
      Martin Storsjö authored
      Previously we first calculated hev, and then negated it.
      
      Since we were able to schedule the negation in the middle
      of another calculation, we don't see any gain in all cases.
      
      Before:                     Cortex A7      A8      A9     A53  A53/AArch64
      vp9_loop_filter_v_4_8_neon:     147.0   129.0   115.8    89.0         88.7
      vp9_loop_filter_v_8_8_neon:     242.0   198.5   174.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    500.0   419.5   382.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   971.2   825.5   731.5   579.0        453.0
      After:
      vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
      vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      e1f9de86
  6. 14 Jan, 2017 2 commits
  7. 16 Nov, 2016 2 commits
  8. 15 Nov, 2016 1 commit
    • Martin Storsjö's avatar
      aarch64: vp9: Implement NEON loop filters · f1212e47
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      These are ported from the ARM version; thanks to the larger
      amount of registers available, we can do the loop filters with
      16 pixels at a time. The implementation is fully templated, with
      a single macro which can generate versions for both 8 and
      16 pixels wide, for both 4, 8 and 16 pixels loop filters
      (and the 4/8 mixed versions as well).
      
      For the 8 pixel wide versions, it is pretty close in speed (the
      v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
      and h_8_8 filters seem to get some gain in the load/transpose/store
      part). For the 16 pixels wide ones, we get a speedup of around
      1.2-1.4x compared to the 32 bit version.
      
      Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                             ARM AArch64
      vp9_loop_filter_h_4_8_neon:          144.0   127.2
      vp9_loop_filter_h_8_8_neon:          207.0   182.5
      vp9_loop_filter_h_16_8_neon:         415.0   328.7
      vp9_loop_filter_h_16_16_neon:        672.0   558.6
      vp9_loop_filter_mix2_h_44_16_neon:   302.0   203.5
      vp9_loop_filter_mix2_h_48_16_neon:   365.0   305.2
      vp9_loop_filter_mix2_h_84_16_neon:   365.0   305.2
      vp9_loop_filter_mix2_h_88_16_neon:   376.0   305.2
      vp9_loop_filter_mix2_v_44_16_neon:   193.2   128.2
      vp9_loop_filter_mix2_v_48_16_neon:   246.7   218.4
      vp9_loop_filter_mix2_v_84_16_neon:   248.0   218.5
      vp9_loop_filter_mix2_v_88_16_neon:   302.0   218.2
      vp9_loop_filter_v_4_8_neon:           89.0    88.7
      vp9_loop_filter_v_8_8_neon:          141.0   137.7
      vp9_loop_filter_v_16_8_neon:         295.0   272.7
      vp9_loop_filter_v_16_16_neon:        546.0   453.7
      
      The speedup vs C code in checkasm tests is around 2-7x, which is
      pretty much the same as for the 32 bit version. Even if these functions
      are faster than their 32 bit equivalent, the C version that we compare
      to also became around 1.3-1.7x faster than the C version in 32 bit.
      
      Based on START_TIMER/STOP_TIMER wrapping around a few individual
      functions, the speedup vs C code is around 4-5x.
      
      Examples of runtimes vs C on a Cortex A57 (for a slightly older version
      of the patch):
                               A57 gcc-5.3  neon
      loop_filter_h_4_8_neon:        256.6  93.4
      loop_filter_h_8_8_neon:        307.3 139.1
      loop_filter_h_16_8_neon:       340.1 254.1
      loop_filter_h_16_16_neon:      827.0 407.9
      loop_filter_mix2_h_44_16_neon: 524.5 155.4
      loop_filter_mix2_h_48_16_neon: 644.5 173.3
      loop_filter_mix2_h_84_16_neon: 630.5 222.0
      loop_filter_mix2_h_88_16_neon: 697.3 222.0
      loop_filter_mix2_v_44_16_neon: 598.5 100.6
      loop_filter_mix2_v_48_16_neon: 651.5 127.0
      loop_filter_mix2_v_84_16_neon: 591.5 167.1
      loop_filter_mix2_v_88_16_neon: 855.1 166.7
      loop_filter_v_4_8_neon:        271.7  65.3
      loop_filter_v_8_8_neon:        312.5 106.9
      loop_filter_v_16_8_neon:       473.3 206.5
      loop_filter_v_16_16_neon:      976.1 327.8
      
      The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
      is again 30-50% faster than the cortex-a53.
      
      This is an adapted cherry-pick from libav commits
      9d2afd1e and
      31756abe.
      Signed-off-by: 's avatarRonald S. Bultje <rsbultje@gmail.com>
      f1212e47
  9. 14 Nov, 2016 1 commit
  10. 13 Nov, 2016 1 commit
    • Martin Storsjö's avatar
      aarch64: vp9: Implement NEON loop filters · 9d2afd1e
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      These are ported from the ARM version; thanks to the larger
      amount of registers available, we can do the loop filters with
      16 pixels at a time. The implementation is fully templated, with
      a single macro which can generate versions for both 8 and
      16 pixels wide, for both 4, 8 and 16 pixels loop filters
      (and the 4/8 mixed versions as well).
      
      For the 8 pixel wide versions, it is pretty close in speed (the
      v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
      and h_8_8 filters seem to get some gain in the load/transpose/store
      part). For the 16 pixels wide ones, we get a speedup of around
      1.2-1.4x compared to the 32 bit version.
      
      Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                             ARM AArch64
      vp9_loop_filter_h_4_8_neon:          144.0   127.2
      vp9_loop_filter_h_8_8_neon:          207.0   182.5
      vp9_loop_filter_h_16_8_neon:         415.0   328.7
      vp9_loop_filter_h_16_16_neon:        672.0   558.6
      vp9_loop_filter_mix2_h_44_16_neon:   302.0   203.5
      vp9_loop_filter_mix2_h_48_16_neon:   365.0   305.2
      vp9_loop_filter_mix2_h_84_16_neon:   365.0   305.2
      vp9_loop_filter_mix2_h_88_16_neon:   376.0   305.2
      vp9_loop_filter_mix2_v_44_16_neon:   193.2   128.2
      vp9_loop_filter_mix2_v_48_16_neon:   246.7   218.4
      vp9_loop_filter_mix2_v_84_16_neon:   248.0   218.5
      vp9_loop_filter_mix2_v_88_16_neon:   302.0   218.2
      vp9_loop_filter_v_4_8_neon:           89.0    88.7
      vp9_loop_filter_v_8_8_neon:          141.0   137.7
      vp9_loop_filter_v_16_8_neon:         295.0   272.7
      vp9_loop_filter_v_16_16_neon:        546.0   453.7
      
      The speedup vs C code in checkasm tests is around 2-7x, which is
      pretty much the same as for the 32 bit version. Even if these functions
      are faster than their 32 bit equivalent, the C version that we compare
      to also became around 1.3-1.7x faster than the C version in 32 bit.
      
      Based on START_TIMER/STOP_TIMER wrapping around a few individual
      functions, the speedup vs C code is around 4-5x.
      
      Examples of runtimes vs C on a Cortex A57 (for a slightly older version
      of the patch):
                               A57 gcc-5.3  neon
      loop_filter_h_4_8_neon:        256.6  93.4
      loop_filter_h_8_8_neon:        307.3 139.1
      loop_filter_h_16_8_neon:       340.1 254.1
      loop_filter_h_16_16_neon:      827.0 407.9
      loop_filter_mix2_h_44_16_neon: 524.5 155.4
      loop_filter_mix2_h_48_16_neon: 644.5 173.3
      loop_filter_mix2_h_84_16_neon: 630.5 222.0
      loop_filter_mix2_h_88_16_neon: 697.3 222.0
      loop_filter_mix2_v_44_16_neon: 598.5 100.6
      loop_filter_mix2_v_48_16_neon: 651.5 127.0
      loop_filter_mix2_v_84_16_neon: 591.5 167.1
      loop_filter_mix2_v_88_16_neon: 855.1 166.7
      loop_filter_v_4_8_neon:        271.7  65.3
      loop_filter_v_8_8_neon:        312.5 106.9
      loop_filter_v_16_8_neon:       473.3 206.5
      loop_filter_v_16_16_neon:      976.1 327.8
      
      The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
      is again 30-50% faster than the cortex-a53.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      9d2afd1e