1. 28 Mar, 2017 1 commit
  2. 27 Mar, 2017 1 commit
  3. 24 Jan, 2017 2 commits
    • Martin Storsjö's avatar
      aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC · 638eceed
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This has mostly got the same differences to the 8 bit version as
      in the arm version. For the horizontal filters, we do 16 pixels
      in parallel as well. For the 8 pixel wide vertical filters, we can
      accumulate 4 rows before storing, just as in the 8 bit version.
      
      Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                                 ARM   AArch64
      vp9_avg4_10bpp_neon:                      35.7      30.7
      vp9_avg8_10bpp_neon:                      93.5      84.7
      vp9_avg16_10bpp_neon:                    324.4     296.6
      vp9_avg32_10bpp_neon:                   1236.5    1148.2
      vp9_avg64_10bpp_neon:                   4639.6    4571.1
      vp9_avg_8tap_smooth_4h_10bpp_neon:       130.0     128.0
      vp9_avg_8tap_smooth_4hv_10bpp_neon:      440.0     440.5
      vp9_avg_8tap_smooth_4v_10bpp_neon:       114.0     105.5
      vp9_avg_8tap_smooth_8h_10bpp_neon:       327.0     314.0
      vp9_avg_8tap_smooth_8hv_10bpp_neon:      918.7     865.4
      vp9_avg_8tap_smooth_8v_10bpp_neon:       330.0     300.2
      vp9_avg_8tap_smooth_16h_10bpp_neon:     1187.5    1155.5
      vp9_avg_8tap_smooth_16hv_10bpp_neon:    2663.1    2591.0
      vp9_avg_8tap_smooth_16v_10bpp_neon:     1107.4    1078.3
      vp9_avg_8tap_smooth_64h_10bpp_neon:    17754.6   17454.7
      vp9_avg_8tap_smooth_64hv_10bpp_neon:   33285.2   33001.5
      vp9_avg_8tap_smooth_64v_10bpp_neon:    16066.9   16048.6
      vp9_put4_10bpp_neon:                      25.5      21.7
      vp9_put8_10bpp_neon:                      56.0      52.0
      vp9_put16_10bpp_neon/armv8:              183.0     163.1
      vp9_put32_10bpp_neon/armv8:              678.6     563.1
      vp9_put64_10bpp_neon/armv8:             2679.9    2195.8
      vp9_put_8tap_smooth_4h_10bpp_neon:       120.0     118.0
      vp9_put_8tap_smooth_4hv_10bpp_neon:      435.2     435.0
      vp9_put_8tap_smooth_4v_10bpp_neon:       107.0      98.2
      vp9_put_8tap_smooth_8h_10bpp_neon:       303.0     290.0
      vp9_put_8tap_smooth_8hv_10bpp_neon:      893.7     828.7
      vp9_put_8tap_smooth_8v_10bpp_neon:       305.5     263.5
      vp9_put_8tap_smooth_16h_10bpp_neon:     1089.1    1059.2
      vp9_put_8tap_smooth_16hv_10bpp_neon:    2578.8    2452.4
      vp9_put_8tap_smooth_16v_10bpp_neon:     1009.5     933.5
      vp9_put_8tap_smooth_64h_10bpp_neon:    16223.4   15918.6
      vp9_put_8tap_smooth_64hv_10bpp_neon:   32153.0   31016.2
      vp9_put_8tap_smooth_64v_10bpp_neon:    14516.5   13748.1
      
      These are generally about as fast as the corresponding ARM
      routines on the same CPU (at least on the A53), in most cases
      marginally faster.
      
      The speedup vs C code is around 4-9x.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      638eceed
    • Martin Storsjö's avatar
      arm: Add NEON optimizations for 10 and 12 bit vp9 MC · a4d4bad7
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      The plain pixel put/copy functions are used from the 8 bit version,
      for the double size (e.g. put16 uses ff_vp9_copy32_neon), and a new
      copy128 is added.
      
      Compared with the 8 bit version, the filters can no longer use the
      trick to accumulate in 16 bit with only saturation at the end, but now
      the accumulators need to be 32 bit. This avoids the need to keep track
      of which filter index is the largest though, reducing the size of the
      executable code for these filters.
      
      For the horizontal filters, we only do 4 or 8 pixels wide in parallel
      (while doing two rows at a time), since we don't have enough register
      space to filter 16 pixels wide.
      
      For the vertical filters, we still do 4 and 8 pixels in parallel just
      as in the 8 bit case, but we need to store the output after every 2
      rows instead of after every 4 rows.
      
      Examples of relative speedup compared to the C version, from checkasm:
                                     Cortex    A7     A8     A9    A53
      vp9_avg4_10bpp_neon:                   2.25   2.44   3.05   2.16
      vp9_avg8_10bpp_neon:                   3.66   8.48   3.86   3.50
      vp9_avg16_10bpp_neon:                  3.39   8.26   3.37   2.72
      vp9_avg32_10bpp_neon:                  4.03  10.20   4.07   3.42
      vp9_avg64_10bpp_neon:                  4.15  10.01   4.13   3.70
      vp9_avg_8tap_smooth_4h_10bpp_neon:     3.38   6.22   3.41   4.75
      vp9_avg_8tap_smooth_4hv_10bpp_neon:    3.89   6.39   4.30   5.32
      vp9_avg_8tap_smooth_4v_10bpp_neon:     5.32   9.73   6.34   7.31
      vp9_avg_8tap_smooth_8h_10bpp_neon:     4.45   9.40   4.68   6.87
      vp9_avg_8tap_smooth_8hv_10bpp_neon:    4.64   8.91   5.44   6.47
      vp9_avg_8tap_smooth_8v_10bpp_neon:     6.44  13.42   8.68   8.79
      vp9_avg_8tap_smooth_64h_10bpp_neon:    4.66   9.02   4.84   7.71
      vp9_avg_8tap_smooth_64hv_10bpp_neon:   4.61   9.14   4.92   7.10
      vp9_avg_8tap_smooth_64v_10bpp_neon:    6.90  14.13   9.57  10.41
      vp9_put4_10bpp_neon:                   1.33   1.46   2.09   1.33
      vp9_put8_10bpp_neon:                   1.57   3.42   1.83   1.84
      vp9_put16_10bpp_neon:                  1.55   4.78   2.17   1.89
      vp9_put32_10bpp_neon:                  2.06   5.35   2.14   2.30
      vp9_put64_10bpp_neon:                  3.00   2.41   1.95   1.66
      vp9_put_8tap_smooth_4h_10bpp_neon:     3.19   5.81   3.31   4.63
      vp9_put_8tap_smooth_4hv_10bpp_neon:    3.86   6.22   4.32   5.21
      vp9_put_8tap_smooth_4v_10bpp_neon:     5.40   9.77   6.08   7.21
      vp9_put_8tap_smooth_8h_10bpp_neon:     4.22   8.41   4.46   6.63
      vp9_put_8tap_smooth_8hv_10bpp_neon:    4.56   8.51   5.39   6.25
      vp9_put_8tap_smooth_8v_10bpp_neon:     6.60  12.43   8.17   8.89
      vp9_put_8tap_smooth_64h_10bpp_neon:    4.41   8.59   4.54   7.49
      vp9_put_8tap_smooth_64hv_10bpp_neon:   4.43   8.58   5.34   6.63
      vp9_put_8tap_smooth_64v_10bpp_neon:    7.26  13.92   9.27  10.92
      
      For the larger 8tap filters, the speedup vs C code is around 4-14x.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      a4d4bad7
  4. 05 May, 2013 1 commit
  5. 19 Mar, 2011 1 commit
  6. 21 Mar, 2009 1 commit
  7. 26 Feb, 2009 1 commit
  8. 17 Feb, 2009 1 commit
  9. 31 Aug, 2008 1 commit
  10. 23 Aug, 2008 2 commits
  11. 17 Aug, 2008 1 commit
  12. 30 Oct, 2007 1 commit
  13. 17 Oct, 2007 1 commit
  14. 17 Jun, 2007 2 commits
  15. 16 Jun, 2007 1 commit
  16. 19 Mar, 2007 1 commit
  17. 28 Feb, 2007 1 commit
  18. 07 Oct, 2006 1 commit
  19. 12 Jan, 2006 1 commit
  20. 17 Dec, 2005 1 commit
  21. 25 Oct, 2003 1 commit
  22. 23 Oct, 2003 1 commit
  23. 22 Oct, 2003 1 commit
  24. 03 Mar, 2003 1 commit
  25. 11 Feb, 2003 1 commit
  26. 20 Nov, 2002 1 commit
  27. 19 Nov, 2002 2 commits
  28. 25 Oct, 2002 1 commit
  29. 06 Oct, 2002 1 commit
  30. 25 May, 2002 1 commit
  31. 13 Aug, 2001 1 commit