1. 04 Nov, 2016 1 commit
  2. 03 Nov, 2016 1 commit
    • Martin Storsjö's avatar
      arm: vp9: Add NEON optimizations of VP9 MC functions · ffbd1d2b
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      The filter coefficients are signed values, where the product of the
      multiplication with one individual filter coefficient doesn't
      overflow a 16 bit signed value (the largest filter coefficient is
      127). But when the products are accumulated, the resulting sum can
      overflow the 16 bit signed range. Instead of accumulating in 32 bit,
      we accumulate the largest product (either index 3 or 4) last with a
      saturated addition.
      
      (The VP8 MC asm does something similar, but slightly simpler, by
      accumulating each half of the filter separately. In the VP9 MC
      filters, each half of the filter can also overflow though, so the
      largest component has to be handled individually.)
      
      Examples of relative speedup compared to the C version, from checkasm:
                             Cortex      A7     A8     A9    A53
      vp9_avg4_neon:                   1.71   1.15   1.42   1.49
      vp9_avg8_neon:                   2.51   3.63   3.14   2.58
      vp9_avg16_neon:                  2.95   6.76   3.01   2.84
      vp9_avg32_neon:                  3.29   6.64   2.85   3.00
      vp9_avg64_neon:                  3.47   6.67   3.14   2.80
      vp9_avg_8tap_smooth_4h_neon:     3.22   4.73   2.76   4.67
      vp9_avg_8tap_smooth_4hv_neon:    3.67   4.76   3.28   4.71
      vp9_avg_8tap_smooth_4v_neon:     5.52   7.60   4.60   6.31
      vp9_avg_8tap_smooth_8h_neon:     6.22   9.04   5.12   9.32
      vp9_avg_8tap_smooth_8hv_neon:    6.38   8.21   5.72   8.17
      vp9_avg_8tap_smooth_8v_neon:     9.22  12.66   8.15  11.10
      vp9_avg_8tap_smooth_64h_neon:    7.02  10.23   5.54  11.58
      vp9_avg_8tap_smooth_64hv_neon:   6.76   9.46   5.93   9.40
      vp9_avg_8tap_smooth_64v_neon:   10.76  14.13   9.46  13.37
      vp9_put4_neon:                   1.11   1.47   1.00   1.21
      vp9_put8_neon:                   1.23   2.17   1.94   1.48
      vp9_put16_neon:                  1.63   4.02   1.73   1.97
      vp9_put32_neon:                  1.56   4.92   2.00   1.96
      vp9_put64_neon:                  2.10   5.28   2.03   2.35
      vp9_put_8tap_smooth_4h_neon:     3.11   4.35   2.63   4.35
      vp9_put_8tap_smooth_4hv_neon:    3.67   4.69   3.25   4.71
      vp9_put_8tap_smooth_4v_neon:     5.45   7.27   4.49   6.52
      vp9_put_8tap_smooth_8h_neon:     5.97   8.18   4.81   8.56
      vp9_put_8tap_smooth_8hv_neon:    6.39   7.90   5.64   8.15
      vp9_put_8tap_smooth_8v_neon:     9.03  11.84   8.07  11.51
      vp9_put_8tap_smooth_64h_neon:    6.78   9.48   4.88  10.89
      vp9_put_8tap_smooth_64hv_neon:   6.99   8.87   5.94   9.56
      vp9_put_8tap_smooth_64v_neon:   10.69  13.30   9.43  14.34
      
      For the larger 8tap filters, the speedup vs C code is around 5-14x.
      
      This is significantly faster than libvpx's implementation of the same
      functions, at least when comparing the put_8tap_smooth_64 functions
      (compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from
      libvpx).
      
      Absolute runtimes from checkasm:
                                Cortex      A7        A8        A9       A53
      vp9_put_8tap_smooth_64h_neon:    20150.3   14489.4   19733.6   10863.7
      libvpx vpx_convolve8_horiz_neon: 52623.3   19736.4   21907.7   25027.7
      
      vp9_put_8tap_smooth_64v_neon:    14455.0   12303.9   13746.4    9628.9
      libvpx vpx_convolve8_vert_neon:  42090.0   17706.2   17659.9   16941.2
      
      Thus, on the A9, the horizontal filter is only marginally faster than
      libvpx, while our version is significantly faster on the other cores,
      and the vertical filter is significantly faster on all cores. The
      difference is especially large on the A7.
      
      The libvpx implementation does the accumulation in 32 bit, which
      probably explains most of the differences.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      ffbd1d2b
  3. 29 Sep, 2016 3 commits
  4. 28 Sep, 2016 1 commit
  5. 22 Sep, 2016 2 commits
  6. 14 Sep, 2016 1 commit
  7. 26 Aug, 2016 4 commits
  8. 17 Aug, 2016 1 commit
  9. 10 Jul, 2016 1 commit
    • Janne Grunau's avatar
      vp8/armv6: mc: avoid boolean expression in calculation · 5f74bd31
      Janne Grunau authored
      GNU as evaluates true as '-1' while Apple's variant and llvm's internal
      assembler evaluate it as '1'. The best way to avoid this madness is to
      eliminate boolean expressions instead of trying to fix it with
      preprocessor directives. Use a direct formula to calculate the
      required temporary space on the stack in
      ff_put_vp8_{epel,bilin}{4,8,16}_h[246]v[246]_armv6().
      
      Fixes a checkasm segfault in vp8dsp.mc when using llvm's internal
      assembler for a non-Apple target.
      5f74bd31
  10. 06 Jul, 2016 1 commit
  11. 26 Jun, 2016 1 commit
  12. 13 May, 2016 1 commit
  13. 04 May, 2016 1 commit
  14. 07 Apr, 2016 1 commit
    • Diego Biurrun's avatar
      build: miscellaneous cosmetics · 01621202
      Diego Biurrun authored
      Restore alphabetical order in lists, break overly long lines, do some
      prettyprinting, add some explanatory section comments, group parts
      together that belong together logically.
      01621202
  15. 01 Mar, 2016 1 commit
  16. 26 Feb, 2016 2 commits
  17. 19 Feb, 2016 1 commit
  18. 24 Dec, 2015 1 commit
  19. 14 Dec, 2015 2 commits
    • Janne Grunau's avatar
      arm: add ff_int32_to_float_fmul_array8_neon · 90b1b935
      Janne Grunau authored
      Quite a bit faster than int32_to_float_fmul_array8_c calling
      ff_int32_to_float_fmul_scalar_neon through FmtConvertContext.
      Number of cycles per int32_to_float_fmul_array8 call while decoding
      padded.dts on exynos5422:
      
                     before  after   change
      cortex-a7:     1270     951    -25%
      cortex-a15:     434     285    -34%
      
      checkasm --bench cycle counts:     cortex-a15   cortex-a7
      int32_to_float_fmul_array8_c:      1730.4       4384.5
      int32_to_float_fmul_array8_neon_c:  571.5       1694.3
      int32_to_float_fmul_array8_neon:    374.0       1448.8
      
      Interesting are the differences between
      int32_to_float_fmul_array8_neon_c and int32_to_float_fmul_array8_neon.
      The former is current behaviour of calling
      ff_int32_to_float_fmul_scalar_neon repeatedly from the c function,
      The raw numbers differ since checkasm uses different lengths than the
      dca decoder.
      90b1b935
    • Janne Grunau's avatar
      arm: add a cpu flag for the VFPv2 vector mode · e2710e79
      Janne Grunau authored
      The vector mode was deprecated in ARMv7-A/VFPv3 and various cpu
      implementations do not support it in hardware. Vector mode code will
      depending the OS either be emulated in software or result in an illegal
      instruction on cpus which does not support it. This was not really
      problem in practice since NEON implementations of the same functions are
      preferred. It will however become a problem for checkasm which tests
      every cpu flag separately.
      
      Since this is a cpu feature newer cpu do not support anymore the
      behaviour of this flag differs from the other flags. It can be only
      activated by runtime cpu feature selection.
      e2710e79
  20. 20 Jul, 2015 1 commit
  21. 17 Jul, 2015 5 commits
  22. 28 Feb, 2015 2 commits
  23. 15 Feb, 2015 1 commit
  24. 09 Dec, 2014 3 commits
  25. 08 Dec, 2014 1 commit