1. 27 Feb, 2019 1 commit
  2. 20 Feb, 2019 2 commits
  3. 19 Feb, 2019 19 commits
    • Martin Storsjö's avatar
      aarch64: vp8: Move the vp8dsp makefile entries to the right places · c8bc9d13
      Martin Storsjö authored
      Even if NEON would be disabled, the init functions should be built
      as they are called as long as ARCH_AARCH64 is set.
      
      These functions are part of a generic DSP subsytem, not tied directly
      to one decoder. (They should be built if the vp7 decoder is enabled,
      even if the vp8 decoder is disabled.)
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      (cherry picked from commit b4b27dce)
      c8bc9d13
    • Martin Storsjö's avatar
      aarch64: vp8: Remove superfluous includes · fecf75a5
      Martin Storsjö authored
      This fixes building with MSVC, which lacks unistd.h.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      (cherry picked from commit ad32f7b1)
      fecf75a5
    • Martin Storsjö's avatar
      aarch64: vp8: Fix assembling with armasm64 · 7ddfa5e9
      Martin Storsjö authored
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      (cherry picked from commit 2eeac799)
      7ddfa5e9
    • Martin Storsjö's avatar
      aarch64: vp8: Fix assembling with clang · c950beb6
      Martin Storsjö authored
      This also partially fixes assembling with MS armasm64 (via
      gas-preprocessor).
      
      The movrel macro invocations need to pass the offset via a separate
      parameter. Mach-o and COFF relocations don't allow a negative
      offset to a symbol, which is handled properly if the offset is passed
      via the parameter. If no offset parameter is given, the macro
      evaluates to something like "adrp x17, subpel_filters-16+(0)", which
      older clang versions also fail to parse (the older clang versions
      only support one single offset term, although it can be a parenthesis.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      (cherry picked from commit 26d7af4c)
      c950beb6
    • Martin Storsjö's avatar
      aarch64: vp8: Optimize vp8_idct_add_neon for aarch64 · 7e42d5f0
      Martin Storsjö authored
      The previous version was a pretty exact translation of the arm
      version. This version does do some unnecessary arithemetic (it does
      more operations on vectors that are only half filled; it does 4
      uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead
      of packing data together (which could be done for free in the arm
      version).
      
      This gives a decent speedup on Cortex A53, a minor speedup on
      A72 and a very minor slowdown on Cortex A73.
      
      Before:        Cortex A53    A72    A73
      vp8_idct_add_neon:   79.7   67.5   65.0
      After:
      vp8_idct_add_neon:   67.7   64.8   66.7
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      7e42d5f0
    • Martin Storsjö's avatar
      aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon · 49f9c427
      Martin Storsjö authored
      The original arm version didn't do saturation here. This probably
      doesn't make any difference for performance, but reduces the
      differences.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      49f9c427
    • Martin Storsjö's avatar
      aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2 · 37394ef0
      Martin Storsjö authored
      This makes it similar to put_epel16_v6, and gives a large speedup
      on Cortex A53, a minor speedup on A72 and a very minor slowdown on
      A73.
      
      Before:                 Cortex A53     A72     A73
      vp8_put_epel16_h6v6_neon:   2211.4  1586.5  1431.7
      After:
      vp8_put_epel16_h6v6_neon:   1736.9  1522.0  1448.1
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      37394ef0
    • Martin Storsjö's avatar
      aarch64: vp8: Port bilin functions from arm version · e39a9212
      Martin Storsjö authored
                            Cortex A53     A72     A73
      vp8_put_bilin4_h_c:        303.8   102.2   161.8
      vp8_put_bilin4_h_neon:     100.0    40.9    41.2
      vp8_put_bilin4_hv_c:       322.8   201.0   305.9
      vp8_put_bilin4_hv_neon:    156.8    72.6    77.0
      vp8_put_bilin4_v_c:        304.7   101.7   166.5
      vp8_put_bilin4_v_neon:      82.7    41.2    33.0
      vp8_put_bilin8_h_c:       1192.7   352.5   623.8
      vp8_put_bilin8_h_neon:     213.5    70.2    87.8
      vp8_put_bilin8_hv_c:      1098.6   769.2  1041.9
      vp8_put_bilin8_hv_neon:    324.0   123.5   146.0
      vp8_put_bilin8_v_c:       1193.9   350.4   617.7
      vp8_put_bilin8_v_neon:     183.9    60.7    64.7
      vp8_put_bilin16_h_c:      2353.1   671.2  1223.3
      vp8_put_bilin16_h_neon:    261.9   140.7   145.0
      vp8_put_bilin16_hv_c:     2453.2  1470.9  2355.2
      vp8_put_bilin16_hv_neon:   383.9   196.0   217.0
      vp8_put_bilin16_v_c:      2349.3   669.8  1251.2
      vp8_put_bilin16_v_neon:    202.9   110.7    96.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      e39a9212
    • Martin Storsjö's avatar
      aarch64: vp8: Port epel4 functions from arm version · 58d15492
      Martin Storsjö authored
                            Cortex A53    A72    A73
      vp8_put_epel4_h4_c:        631.4  291.7  367.8
      vp8_put_epel4_h4_neon:     241.0  131.0  155.7
      vp8_put_epel4_h4v4_c:      967.5  529.3  667.7
      vp8_put_epel4_h4v4_neon:   429.3  241.8  279.7
      vp8_put_epel4_h4v6_c:     1374.7  657.5  864.5
      vp8_put_epel4_h4v6_neon:   515.5  295.5  334.7
      vp8_put_epel4_h6_c:        851.0  421.0  486.0
      vp8_put_epel4_h6_neon:     321.5  195.0  217.7
      vp8_put_epel4_h6v4_c:     1111.3  621.1  781.2
      vp8_put_epel4_h6v4_neon:   539.2  328.0  365.3
      vp8_put_epel4_h6v6_c:     1561.3  763.3  999.7
      vp8_put_epel4_h6v6_neon:   645.5  401.0  434.7
      vp8_put_epel4_v4_c:        663.8  298.3  357.0
      vp8_put_epel4_v4_neon:     116.0   81.5   72.5
      vp8_put_epel4_v6_c:        870.5  437.0  507.4
      vp8_put_epel4_v6_neon:     147.7  108.8   92.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      58d15492
    • Martin Storsjö's avatar
      aarch64: vp8: Port missing epel8 functions from arm version · cc7ba00c
      Martin Storsjö authored
                            Cortex A53     A72     A73
      vp8_put_epel8_h4_c:       2594.8  1159.6  1374.8
      vp8_put_epel8_h4_neon:     506.4   244.2   314.0
      vp8_put_epel8_h6_c:       3445.8  1677.1  1811.3
      vp8_put_epel8_h6_neon:     634.4   371.7   433.0
      vp8_put_epel8_v4_c:       2614.0  1174.8  1378.0
      vp8_put_epel8_v4_neon:     321.0   221.7   235.8
      vp8_put_epel8_v6_c:       3635.5  1703.0  2079.2
      vp8_put_epel8_v6_neon:     416.9   317.0   295.5
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      cc7ba00c
    • Martin Storsjö's avatar
      aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version · 52c9b0a6
      Martin Storsjö authored
                           Cortex A53    A72    A73
      vp8_luma_dc_wht_c:        115.7   75.7   90.7
      vp8_luma_dc_wht_neon:      60.7   41.2   45.7
      vp8_idct_dc_add4uv_c:     376.1  262.9  282.5
      vp8_idct_dc_add4uv_neon:   52.0   29.0   37.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      52c9b0a6
    • Martin Storsjö's avatar
      c513fcd7
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: vp8: Move the vp8dsp makefile entries to the right places · b4b27dce
      Martin Storsjö authored
      Even if NEON would be disabled, the init functions should be built
      as they are called as long as ARCH_AARCH64 is set.
      
      These functions are part of a generic DSP subsytem, not tied directly
      to one decoder. (They should be built if the vp7 decoder is enabled,
      even if the vp8 decoder is disabled.)
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b4b27dce
    • Martin Storsjö's avatar
      aarch64: vp8: Remove superfluous includes · ad32f7b1
      Martin Storsjö authored
      This fixes building with MSVC, which lacks unistd.h.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      ad32f7b1
    • Martin Storsjö's avatar
      aarch64: vp8: Use the proper aarch64 form for conditional branches · 85bfaa49
      Martin Storsjö authored
      The previous form also does seem to assemble on current tools,
      but I think it might fail on some older aarch64 tools.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      85bfaa49
    • Martin Storsjö's avatar
      2eeac799
    • Martin Storsjö's avatar
      aarch64: vp8: Fix assembling with clang · 26d7af4c
      Martin Storsjö authored
      This also partially fixes assembling with MS armasm64 (via
      gas-preprocessor).
      
      The movrel macro invocations need to pass the offset via a separate
      parameter. Mach-o and COFF relocations don't allow a negative
      offset to a symbol, which is handled properly if the offset is passed
      via the parameter. If no offset parameter is given, the macro
      evaluates to something like "adrp x17, subpel_filters-16+(0)", which
      older clang versions also fail to parse (the older clang versions
      only support one single offset term, although it can be a parenthesis.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      26d7af4c
    • Magnus Röös's avatar
      libavcodec: vp8 neon optimizations for aarch64 · 0801853e
      Magnus Röös authored
      Partial port of the ARM Neon for aarch64.
      
      Benchmarks from fate:
      
      benchmarking with Linux Perf Monitoring API
      nop: 58.6
      checkasm: using random seed 1760970128
      NEON:
       - vp8dsp.idct       [OK]
       - vp8dsp.mc         [OK]
       - vp8dsp.loopfilter [OK]
      checkasm: all 21 tests passed
      vp8_idct_add_c: 201.6
      vp8_idct_add_neon: 83.1
      vp8_idct_dc_add_c: 107.6
      vp8_idct_dc_add_neon: 33.8
      vp8_idct_dc_add4y_c: 426.4
      vp8_idct_dc_add4y_neon: 59.4
      vp8_loop_filter8uv_h_c: 688.1
      vp8_loop_filter8uv_h_neon: 216.3
      vp8_loop_filter8uv_inner_h_c: 649.3
      vp8_loop_filter8uv_inner_h_neon: 195.3
      vp8_loop_filter8uv_inner_v_c: 544.8
      vp8_loop_filter8uv_inner_v_neon: 131.3
      vp8_loop_filter8uv_v_c: 706.1
      vp8_loop_filter8uv_v_neon: 141.1
      vp8_loop_filter16y_h_c: 668.8
      vp8_loop_filter16y_h_neon: 242.8
      vp8_loop_filter16y_inner_h_c: 647.3
      vp8_loop_filter16y_inner_h_neon: 224.6
      vp8_loop_filter16y_inner_v_c: 647.8
      vp8_loop_filter16y_inner_v_neon: 128.8
      vp8_loop_filter16y_v_c: 721.8
      vp8_loop_filter16y_v_neon: 154.3
      vp8_loop_filter_simple_h_c: 387.8
      vp8_loop_filter_simple_h_neon: 187.6
      vp8_loop_filter_simple_v_c: 384.1
      vp8_loop_filter_simple_v_neon: 78.6
      vp8_put_epel8_h4v4_c: 3971.1
      vp8_put_epel8_h4v4_neon: 855.1
      vp8_put_epel8_h4v6_c: 5060.1
      vp8_put_epel8_h4v6_neon: 989.6
      vp8_put_epel8_h6v4_c: 4320.8
      vp8_put_epel8_h6v4_neon: 1007.3
      vp8_put_epel8_h6v6_c: 5449.3
      vp8_put_epel8_h6v6_neon: 1158.1
      vp8_put_epel16_h6_c: 6683.8
      vp8_put_epel16_h6_neon: 831.8
      vp8_put_epel16_h6v6_c: 11110.8
      vp8_put_epel16_h6v6_neon: 2214.8
      vp8_put_epel16_v6_c: 7024.8
      vp8_put_epel16_v6_neon: 799.6
      vp8_put_pixels8_c: 112.8
      vp8_put_pixels8_neon: 78.1
      vp8_put_pixels16_c: 131.3
      vp8_put_pixels16_neon: 129.8
      
      This contains a fix to include guards by Carl Eugen Hoyos.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      0801853e
  4. 31 Jan, 2019 2 commits
    • Carl Eugen Hoyos's avatar
      lavc/aarch64/vp8dsp: Fix the include guard. · ed20fbcd
      Carl Eugen Hoyos authored
      Fixes fate-source.
      ed20fbcd
    • Magnus Röös's avatar
      libavcodec: vp8 neon optimizations for aarch64 · 833fed52
      Magnus Röös authored
      Partial port of the ARM Neon for aarch64.
      
      Benchmarks from fate:
      
      benchmarking with Linux Perf Monitoring API
      nop: 58.6
      checkasm: using random seed 1760970128
      NEON:
       - vp8dsp.idct       [OK]
       - vp8dsp.mc         [OK]
       - vp8dsp.loopfilter [OK]
      checkasm: all 21 tests passed
      vp8_idct_add_c: 201.6
      vp8_idct_add_neon: 83.1
      vp8_idct_dc_add_c: 107.6
      vp8_idct_dc_add_neon: 33.8
      vp8_idct_dc_add4y_c: 426.4
      vp8_idct_dc_add4y_neon: 59.4
      vp8_loop_filter8uv_h_c: 688.1
      vp8_loop_filter8uv_h_neon: 216.3
      vp8_loop_filter8uv_inner_h_c: 649.3
      vp8_loop_filter8uv_inner_h_neon: 195.3
      vp8_loop_filter8uv_inner_v_c: 544.8
      vp8_loop_filter8uv_inner_v_neon: 131.3
      vp8_loop_filter8uv_v_c: 706.1
      vp8_loop_filter8uv_v_neon: 141.1
      vp8_loop_filter16y_h_c: 668.8
      vp8_loop_filter16y_h_neon: 242.8
      vp8_loop_filter16y_inner_h_c: 647.3
      vp8_loop_filter16y_inner_h_neon: 224.6
      vp8_loop_filter16y_inner_v_c: 647.8
      vp8_loop_filter16y_inner_v_neon: 128.8
      vp8_loop_filter16y_v_c: 721.8
      vp8_loop_filter16y_v_neon: 154.3
      vp8_loop_filter_simple_h_c: 387.8
      vp8_loop_filter_simple_h_neon: 187.6
      vp8_loop_filter_simple_v_c: 384.1
      vp8_loop_filter_simple_v_neon: 78.6
      vp8_put_epel8_h4v4_c: 3971.1
      vp8_put_epel8_h4v4_neon: 855.1
      vp8_put_epel8_h4v6_c: 5060.1
      vp8_put_epel8_h4v6_neon: 989.6
      vp8_put_epel8_h6v4_c: 4320.8
      vp8_put_epel8_h6v4_neon: 1007.3
      vp8_put_epel8_h6v6_c: 5449.3
      vp8_put_epel8_h6v6_neon: 1158.1
      vp8_put_epel16_h6_c: 6683.8
      vp8_put_epel16_h6_neon: 831.8
      vp8_put_epel16_h6v6_c: 11110.8
      vp8_put_epel16_h6v6_neon: 2214.8
      vp8_put_epel16_v6_c: 7024.8
      vp8_put_epel16_v6_neon: 799.6
      vp8_put_pixels8_c: 112.8
      vp8_put_pixels8_neon: 78.1
      vp8_put_pixels16_c: 131.3
      vp8_put_pixels16_neon: 129.8
      Signed-off-by: 's avatarMagnus Röös <mla2.roos@gmail.com>
      833fed52
  5. 26 Jan, 2019 3 commits
    • Janne Grunau's avatar
      h264/aarch64: add intra loop filter neon asm · 28a8b541
      Janne Grunau authored
      Add my neon asm from x264 relicensed under the LGPL 2.1 or later. Ported
      (x264 uses nv12 chroma) and optimized.
      
      Cycle count for checkasm --bench on a Snapdragon 820e:
      h264_h_loop_filter_luma_intra_8bpp_c: 60.0
      h264_h_loop_filter_luma_intra_8bpp_neon: 54.2
      h264_v_loop_filter_luma_intra_8bpp_c: 148.3
      h264_v_loop_filter_luma_intra_8bpp_neon: 73.8
      h264_h_loop_filter_chroma_intra_8bpp_c: 27.8
      h264_h_loop_filter_chroma_intra_8bpp_neon: 21.4
      h264_h_loop_filter_chroma_mbaff_intra_8bpp_c: 15.8
      h264_h_loop_filter_chroma_mbaff_intra_8bpp_neon: 15.7
      h264_v_loop_filter_chroma_intra_8bpp_c: 45.8
      h264_v_loop_filter_chroma_intra_8bpp_neon: 17.3
      28a8b541
    • Janne Grunau's avatar
      h264/aarch64: optimize neon loop filter · 846c3d6a
      Janne Grunau authored
      Exit as soon as possible if no filtering will be done.
      
      Improves the checkasm --bench cycle count on a Snapdragon 820e:
      h264_h_loop_filter_luma_8bpp_c:      72.4 ->  72.5
      h264_h_loop_filter_luma_8bpp_neon:   97.1 ->  56.3
      h264_v_loop_filter_luma_8bpp_c:     174.0 -> 173.5
      h264_v_loop_filter_luma_8bpp_neon:   62.9 ->  60.9
      h264_h_loop_filter_chroma_8bpp_c:    30.2 ->  30.3
      h264_h_loop_filter_chroma_8bpp_neon: 51.6 ->  25.7
      h264_v_loop_filter_chroma_8bpp_c:    57.3 ->  57.3
      h264_v_loop_filter_chroma_8bpp_neon: 28.0 ->  24.0
      846c3d6a
    • Janne Grunau's avatar
      bb515e3a
  6. 03 Jan, 2019 1 commit
  7. 13 Jul, 2018 1 commit
    • Carl Eugen Hoyos's avatar
      lavc/aarch64/h264dsp_init_aarch64: Fix weight function prototypes. · 0576ef46
      Carl Eugen Hoyos authored
      Fixes the following warnings:
      libavcodec/aarch64/h264dsp_init_aarch64.c: In function ‘ff_h264dsp_init_aarch64’:
      libavcodec/aarch64/h264dsp_init_aarch64.c:84:38: warning: assignment from incompatible pointer type [enabled by default]
               c->weight_h264_pixels_tab[0] = ff_weight_h264_pixels_16_neon;
                                            ^
      libavcodec/aarch64/h264dsp_init_aarch64.c:85:38: warning: assignment from incompatible pointer type [enabled by default]
               c->weight_h264_pixels_tab[1] = ff_weight_h264_pixels_8_neon;
                                            ^
      libavcodec/aarch64/h264dsp_init_aarch64.c:86:38: warning: assignment from incompatible pointer type [enabled by default]
               c->weight_h264_pixels_tab[2] = ff_weight_h264_pixels_4_neon;
                                            ^
      libavcodec/aarch64/h264dsp_init_aarch64.c:88:40: warning: assignment from incompatible pointer type [enabled by default]
               c->biweight_h264_pixels_tab[0] = ff_biweight_h264_pixels_16_neon;
                                              ^
      libavcodec/aarch64/h264dsp_init_aarch64.c:89:40: warning: assignment from incompatible pointer type [enabled by default]
               c->biweight_h264_pixels_tab[1] = ff_biweight_h264_pixels_8_neon;
                                              ^
      libavcodec/aarch64/h264dsp_init_aarch64.c:90:40: warning: assignment from incompatible pointer type [enabled by default]
               c->biweight_h264_pixels_tab[2] = ff_biweight_h264_pixels_4_neon;
                                              ^
      0576ef46
  8. 26 Jan, 2018 1 commit
  9. 18 Oct, 2017 1 commit
  10. 03 Jul, 2017 1 commit
    • Matthieu Bouron's avatar
      lavc/aarch64: add sbrdsp neon implementation · 0a24d7ca
      Matthieu Bouron authored
      autocorrelate_c: 644.0
      autocorrelate_neon: 420.0
      hf_apply_noise_0_c: 1688.5
      hf_apply_noise_0_neon: 1498.6
      hf_apply_noise_1_c: 1691.2
      hf_apply_noise_1_neon: 1500.6
      hf_apply_noise_2_c: 1688.1
      hf_apply_noise_2_neon: 1500.3
      hf_apply_noise_3_c: 1696.6
      hf_apply_noise_3_neon: 1502.2
      hf_g_filt_c: 2117.8
      hf_g_filt_neon: 1218.7
      hf_gen_c: 4573.4
      hf_gen_neon: 2461.0
      neg_odd_64_c: 72.0
      neg_odd_64_neon: 64.7
      qmf_deint_bfly_c: 1107.6
      qmf_deint_bfly_neon: 291.6
      qmf_deint_neg_c: 210.4
      qmf_deint_neg_neon: 107.4
      qmf_post_shuffle_c: 163.0
      qmf_post_shuffle_neon: 107.7
      qmf_pre_shuffle_c: 120.5
      qmf_pre_shuffle_neon: 110.7
      sum64x5_c: 1361.6
      sum64x5_neon: 435.4
      sum_square_c: 1686.4
      sum_square_neon: 787.2
      0a24d7ca
  11. 28 Jun, 2017 2 commits
    • Clément Bœsch's avatar
    • Clément Bœsch's avatar
      lavc/aarch64: add a few SIMD functions for AAC PS · ff0ecef6
      Clément Bœsch authored
      ☭ tests/checkasm/checkasm --bench --test=aacpsdsp
      checkasm: using random seed 3318985180
      MMX implied by specified flags
      MMX implied by specified flags
      NEON:
       - aacpsdsp.add_squares        [OK]
       - aacpsdsp.mul_pair_single    [OK]
       - aacpsdsp.hybrid_analysis    [OK]
       - aacpsdsp.stereo_interpolate [OK]
      checkasm: all 5 tests passed
      nop: 10.0
      ps_add_squares_c: 63221.2
      ps_add_squares_neon: 22311.7
      ps_hybrid_analysis_c: 2466.6
      ps_hybrid_analysis_neon: 1521.9
      ps_mul_pair_single_c: 68592.0
      ps_mul_pair_single_neon: 17426.6
      ps_stereo_interpolate_c: 72344.3
      ps_stereo_interpolate_neon: 72308.8
      ps_stereo_interpolate_ipdopd_c: 117415.2
      ps_stereo_interpolate_ipdopd_neon: 113386.3
      ff0ecef6
  12. 21 Jun, 2017 2 commits
  13. 20 Jun, 2017 1 commit
  14. 14 Jun, 2017 1 commit
  15. 13 Jun, 2017 1 commit
  16. 11 May, 2017 1 commit