1. 19 Mar, 2017 1 commit
  2. 16 Mar, 2017 1 commit
  3. 11 Mar, 2017 5 commits
    • Martin Storsjö's avatar
      arm: vp9lpf: Implement the mix2_44 function with one single filter pass · a88db8b9
      Martin Storsjö authored
      For this case, with 8 inputs but only changing 4 of them, we can fit
      all 16 input pixels into a q register, and still have enough temporary
      registers for doing the loop filter.
      
      The wd=8 filters would require too many temporary registers for
      processing all 16 pixels at once though.
      
      Before:                          Cortex A7      A8     A9     A53
      vp9_loop_filter_mix2_v_44_16_neon:   289.7   256.2  237.5   181.2
      After:
      vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  177.7   138.0
      
      This is cherrypicked from libav commit
      575e31e9.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      a88db8b9
    • Martin Storsjö's avatar
      arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit · 3fbbad29
      Martin Storsjö authored
      The theoretical maximum value of E is 193, so we can just
      saturate the addition to 255.
      
      Before:                     Cortex A7      A8      A9     A53  A53/AArch64
      vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
      vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
      After:
      vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
      vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
      vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
      vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7
      
      This is cherrypicked from libav commit
      c582cb85.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      3fbbad29
    • Martin Storsjö's avatar
      arm: vp9lpf: Interleave the start of flat8in into the calculation above · 83399cf5
      Martin Storsjö authored
      This adds lots of extra .ifs, but speeds it up by a couple cycles,
      by avoiding stalls.
      
      This is cherrypicked from libav commit
      e18c3900.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      83399cf5
    • Martin Storsjö's avatar
      arm: vp9lpf: Use orrs instead of orr+cmp · 92ab8374
      Martin Storsjö authored
      This is cherrypicked from libav commit
      435cd7bc.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      92ab8374
    • Martin Storsjö's avatar
      arm/aarch64: vp9lpf: Calculate !hev directly · f0ecbb13
      Martin Storsjö authored
      Previously we first calculated hev, and then negated it.
      
      Since we were able to schedule the negation in the middle
      of another calculation, we don't see any gain in all cases.
      
      Before:                     Cortex A7      A8      A9     A53  A53/AArch64
      vp9_loop_filter_v_4_8_neon:     147.0   129.0   115.8    89.0         88.7
      vp9_loop_filter_v_8_8_neon:     242.0   198.5   174.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    500.0   419.5   382.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   971.2   825.5   731.5   579.0        453.0
      After:
      vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
      vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
      
      This is cherrypicked from libav commit
      e1f9de86.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      f0ecbb13
  4. 23 Feb, 2017 2 commits
    • Martin Storsjö's avatar
      arm: vp9lpf: Implement the mix2_44 function with one single filter pass · 575e31e9
      Martin Storsjö authored
      For this case, with 8 inputs but only changing 4 of them, we can fit
      all 16 input pixels into a q register, and still have enough temporary
      registers for doing the loop filter.
      
      The wd=8 filters would require too many temporary registers for
      processing all 16 pixels at once though.
      
      Before:                          Cortex A7      A8     A9     A53
      vp9_loop_filter_mix2_v_44_16_neon:   289.7   256.2  237.5   181.2
      After:
      vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  177.7   138.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      575e31e9
    • Martin Storsjö's avatar
      arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit · c582cb85
      Martin Storsjö authored
      The theoretical maximum value of E is 193, so we can just
      saturate the addition to 255.
      
      Before:                     Cortex A7      A8      A9     A53  A53/AArch64
      vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
      vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
      After:
      vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
      vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
      vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
      vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      c582cb85
  5. 11 Feb, 2017 1 commit
  6. 10 Feb, 2017 2 commits
    • Martin Storsjö's avatar
      435cd7bc
    • Martin Storsjö's avatar
      arm/aarch64: vp9lpf: Calculate !hev directly · e1f9de86
      Martin Storsjö authored
      Previously we first calculated hev, and then negated it.
      
      Since we were able to schedule the negation in the middle
      of another calculation, we don't see any gain in all cases.
      
      Before:                     Cortex A7      A8      A9     A53  A53/AArch64
      vp9_loop_filter_v_4_8_neon:     147.0   129.0   115.8    89.0         88.7
      vp9_loop_filter_v_8_8_neon:     242.0   198.5   174.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    500.0   419.5   382.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   971.2   825.5   731.5   579.0        453.0
      After:
      vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
      vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
      vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
      vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      e1f9de86
  7. 15 Nov, 2016 1 commit
    • Martin Storsjö's avatar
      arm: vp9: Add NEON loop filters · 6bec60a6
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      The implementation tries to have smart handling of cases
      where no pixels need the full filtering for the 8/16 width
      filters, skipping both calculation and writeback of the
      unmodified pixels in those cases. The actual effect of this
      is hard to test with checkasm though, since it tests the
      full filtering, and the benefit depends on how many filtered
      blocks use the shortcut.
      
      Examples of relative speedup compared to the C version, from checkasm:
                                Cortex       A7     A8     A9    A53
      vp9_loop_filter_h_4_8_neon:          2.72   2.68   1.78   3.15
      vp9_loop_filter_h_8_8_neon:          2.36   2.38   1.70   2.91
      vp9_loop_filter_h_16_8_neon:         1.80   1.89   1.45   2.01
      vp9_loop_filter_h_16_16_neon:        2.81   2.78   2.18   3.16
      vp9_loop_filter_mix2_h_44_16_neon:   2.65   2.67   1.93   3.05
      vp9_loop_filter_mix2_h_48_16_neon:   2.46   2.38   1.81   2.85
      vp9_loop_filter_mix2_h_84_16_neon:   2.50   2.41   1.73   2.85
      vp9_loop_filter_mix2_h_88_16_neon:   2.77   2.66   1.96   3.23
      vp9_loop_filter_mix2_v_44_16_neon:   4.28   4.46   3.22   5.70
      vp9_loop_filter_mix2_v_48_16_neon:   3.92   4.00   3.03   5.19
      vp9_loop_filter_mix2_v_84_16_neon:   3.97   4.31   2.98   5.33
      vp9_loop_filter_mix2_v_88_16_neon:   3.91   4.19   3.06   5.18
      vp9_loop_filter_v_4_8_neon:          4.53   4.47   3.31   6.05
      vp9_loop_filter_v_8_8_neon:          3.58   3.99   2.92   5.17
      vp9_loop_filter_v_16_8_neon:         3.40   3.50   2.81   4.68
      vp9_loop_filter_v_16_16_neon:        4.66   4.41   3.74   6.02
      
      The speedup vs C code is around 2-6x. The numbers are quite
      inconclusive though, since the checkasm test runs multiple filterings
      on top of each other, so later rounds might end up with different
      codepaths (different decisions on which filter to apply, based
      on input pixel differences). Disabling the early-exit in the asm
      doesn't give a fair comparison either though, since the C code
      only does the necessary calcuations for each row.
      
      Based on START_TIMER/STOP_TIMER wrapping around a few individual
      functions, the speedup vs C code is around 4-9x.
      
      This is pretty similar in runtime to the corresponding routines
      in libvpx. (This is comparing vpx_lpf_vertical_16_neon,
      vpx_lpf_horizontal_edge_8_neon and vpx_lpf_horizontal_edge_16_neon
      to vp9_loop_filter_h_16_8_neon, vp9_loop_filter_v_16_8_neon
      and vp9_loop_filter_v_16_16_neon - note that the naming of horizonal
      and vertical is flipped between the libraries.)
      
      In order to have stable, comparable numbers, the early exits in both
      asm versions were disabled, forcing the full filtering codepath.
      
                                 Cortex           A7      A8      A9     A53
      vp9_loop_filter_h_16_8_neon:             597.2   472.0   482.4   415.0
      libvpx vpx_lpf_vertical_16_neon:         626.0   464.5   470.7   445.0
      vp9_loop_filter_v_16_8_neon:             500.2   422.5   429.7   295.0
      libvpx vpx_lpf_horizontal_edge_8_neon:   586.5   414.5   415.6   383.2
      vp9_loop_filter_v_16_16_neon:            905.0   784.7   791.5   546.0
      libvpx vpx_lpf_horizontal_edge_16_neon: 1060.2   751.7   743.5   685.2
      
      Our version is consistently faster on on A7 and A53, marginally slower on
      A8, and sometimes faster, sometimes slower on A9 (marginally slower in all
      three tests in this particular test run).
      
      This is an adapted cherry-pick from libav commit
      dd299a2d.
      Signed-off-by: 's avatarRonald S. Bultje <rsbultje@gmail.com>
      6bec60a6
  8. 11 Nov, 2016 1 commit
    • Martin Storsjö's avatar
      arm: vp9: Add NEON loop filters · dd299a2d
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      The implementation tries to have smart handling of cases
      where no pixels need the full filtering for the 8/16 width
      filters, skipping both calculation and writeback of the
      unmodified pixels in those cases. The actual effect of this
      is hard to test with checkasm though, since it tests the
      full filtering, and the benefit depends on how many filtered
      blocks use the shortcut.
      
      Examples of relative speedup compared to the C version, from checkasm:
                                Cortex       A7     A8     A9    A53
      vp9_loop_filter_h_4_8_neon:          2.72   2.68   1.78   3.15
      vp9_loop_filter_h_8_8_neon:          2.36   2.38   1.70   2.91
      vp9_loop_filter_h_16_8_neon:         1.80   1.89   1.45   2.01
      vp9_loop_filter_h_16_16_neon:        2.81   2.78   2.18   3.16
      vp9_loop_filter_mix2_h_44_16_neon:   2.65   2.67   1.93   3.05
      vp9_loop_filter_mix2_h_48_16_neon:   2.46   2.38   1.81   2.85
      vp9_loop_filter_mix2_h_84_16_neon:   2.50   2.41   1.73   2.85
      vp9_loop_filter_mix2_h_88_16_neon:   2.77   2.66   1.96   3.23
      vp9_loop_filter_mix2_v_44_16_neon:   4.28   4.46   3.22   5.70
      vp9_loop_filter_mix2_v_48_16_neon:   3.92   4.00   3.03   5.19
      vp9_loop_filter_mix2_v_84_16_neon:   3.97   4.31   2.98   5.33
      vp9_loop_filter_mix2_v_88_16_neon:   3.91   4.19   3.06   5.18
      vp9_loop_filter_v_4_8_neon:          4.53   4.47   3.31   6.05
      vp9_loop_filter_v_8_8_neon:          3.58   3.99   2.92   5.17
      vp9_loop_filter_v_16_8_neon:         3.40   3.50   2.81   4.68
      vp9_loop_filter_v_16_16_neon:        4.66   4.41   3.74   6.02
      
      The speedup vs C code is around 2-6x. The numbers are quite
      inconclusive though, since the checkasm test runs multiple filterings
      on top of each other, so later rounds might end up with different
      codepaths (different decisions on which filter to apply, based
      on input pixel differences). Disabling the early-exit in the asm
      doesn't give a fair comparison either though, since the C code
      only does the necessary calcuations for each row.
      
      Based on START_TIMER/STOP_TIMER wrapping around a few individual
      functions, the speedup vs C code is around 4-9x.
      
      This is pretty similar in runtime to the corresponding routines
      in libvpx. (This is comparing vpx_lpf_vertical_16_neon,
      vpx_lpf_horizontal_edge_8_neon and vpx_lpf_horizontal_edge_16_neon
      to vp9_loop_filter_h_16_8_neon, vp9_loop_filter_v_16_8_neon
      and vp9_loop_filter_v_16_16_neon - note that the naming of horizonal
      and vertical is flipped between the libraries.)
      
      In order to have stable, comparable numbers, the early exits in both
      asm versions were disabled, forcing the full filtering codepath.
      
                                 Cortex           A7      A8      A9     A53
      vp9_loop_filter_h_16_8_neon:             597.2   472.0   482.4   415.0
      libvpx vpx_lpf_vertical_16_neon:         626.0   464.5   470.7   445.0
      vp9_loop_filter_v_16_8_neon:             500.2   422.5   429.7   295.0
      libvpx vpx_lpf_horizontal_edge_8_neon:   586.5   414.5   415.6   383.2
      vp9_loop_filter_v_16_16_neon:            905.0   784.7   791.5   546.0
      libvpx vpx_lpf_horizontal_edge_16_neon: 1060.2   751.7   743.5   685.2
      
      Our version is consistently faster on on A7 and A53, marginally slower on
      A8, and sometimes faster, sometimes slower on A9 (marginally slower in all
      three tests in this particular test run).
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      dd299a2d