1. 21 Mar, 2019 1 commit
  2. 27 Feb, 2019 2 commits
  3. 19 Feb, 2019 17 commits
    • Martin Storsjö's avatar
      aarch64: vp8: Optimize vp8_idct_add_neon for aarch64 · 7e42d5f0
      Martin Storsjö authored
      The previous version was a pretty exact translation of the arm
      version. This version does do some unnecessary arithemetic (it does
      more operations on vectors that are only half filled; it does 4
      uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead
      of packing data together (which could be done for free in the arm
      version).
      
      This gives a decent speedup on Cortex A53, a minor speedup on
      A72 and a very minor slowdown on Cortex A73.
      
      Before:        Cortex A53    A72    A73
      vp8_idct_add_neon:   79.7   67.5   65.0
      After:
      vp8_idct_add_neon:   67.7   64.8   66.7
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      7e42d5f0
    • Martin Storsjö's avatar
      aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon · 49f9c427
      Martin Storsjö authored
      The original arm version didn't do saturation here. This probably
      doesn't make any difference for performance, but reduces the
      differences.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      49f9c427
    • Martin Storsjö's avatar
      aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2 · 37394ef0
      Martin Storsjö authored
      This makes it similar to put_epel16_v6, and gives a large speedup
      on Cortex A53, a minor speedup on A72 and a very minor slowdown on
      A73.
      
      Before:                 Cortex A53     A72     A73
      vp8_put_epel16_h6v6_neon:   2211.4  1586.5  1431.7
      After:
      vp8_put_epel16_h6v6_neon:   1736.9  1522.0  1448.1
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      37394ef0
    • Martin Storsjö's avatar
      arm: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2 · cef914e0
      Martin Storsjö authored
      This makes it similar to put_epel16_v6, and gives a 10-25%
      speedup of this function.
      
      Before:                   Cortex A7       A8       A9      A53     A72
      vp8_put_epel16_h6v6_neon:    3058.0   2218.5   2459.8   2183.0  1572.2
      After:
      vp8_put_epel16_h6v6_neon:    2670.8   1934.2   2244.4   1729.4  1503.9
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      cef914e0
    • Martin Storsjö's avatar
      aarch64: vp8: Port bilin functions from arm version · e39a9212
      Martin Storsjö authored
                            Cortex A53     A72     A73
      vp8_put_bilin4_h_c:        303.8   102.2   161.8
      vp8_put_bilin4_h_neon:     100.0    40.9    41.2
      vp8_put_bilin4_hv_c:       322.8   201.0   305.9
      vp8_put_bilin4_hv_neon:    156.8    72.6    77.0
      vp8_put_bilin4_v_c:        304.7   101.7   166.5
      vp8_put_bilin4_v_neon:      82.7    41.2    33.0
      vp8_put_bilin8_h_c:       1192.7   352.5   623.8
      vp8_put_bilin8_h_neon:     213.5    70.2    87.8
      vp8_put_bilin8_hv_c:      1098.6   769.2  1041.9
      vp8_put_bilin8_hv_neon:    324.0   123.5   146.0
      vp8_put_bilin8_v_c:       1193.9   350.4   617.7
      vp8_put_bilin8_v_neon:     183.9    60.7    64.7
      vp8_put_bilin16_h_c:      2353.1   671.2  1223.3
      vp8_put_bilin16_h_neon:    261.9   140.7   145.0
      vp8_put_bilin16_hv_c:     2453.2  1470.9  2355.2
      vp8_put_bilin16_hv_neon:   383.9   196.0   217.0
      vp8_put_bilin16_v_c:      2349.3   669.8  1251.2
      vp8_put_bilin16_v_neon:    202.9   110.7    96.2
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      e39a9212
    • Martin Storsjö's avatar
      aarch64: vp8: Port epel4 functions from arm version · 58d15492
      Martin Storsjö authored
                            Cortex A53    A72    A73
      vp8_put_epel4_h4_c:        631.4  291.7  367.8
      vp8_put_epel4_h4_neon:     241.0  131.0  155.7
      vp8_put_epel4_h4v4_c:      967.5  529.3  667.7
      vp8_put_epel4_h4v4_neon:   429.3  241.8  279.7
      vp8_put_epel4_h4v6_c:     1374.7  657.5  864.5
      vp8_put_epel4_h4v6_neon:   515.5  295.5  334.7
      vp8_put_epel4_h6_c:        851.0  421.0  486.0
      vp8_put_epel4_h6_neon:     321.5  195.0  217.7
      vp8_put_epel4_h6v4_c:     1111.3  621.1  781.2
      vp8_put_epel4_h6v4_neon:   539.2  328.0  365.3
      vp8_put_epel4_h6v6_c:     1561.3  763.3  999.7
      vp8_put_epel4_h6v6_neon:   645.5  401.0  434.7
      vp8_put_epel4_v4_c:        663.8  298.3  357.0
      vp8_put_epel4_v4_neon:     116.0   81.5   72.5
      vp8_put_epel4_v6_c:        870.5  437.0  507.4
      vp8_put_epel4_v6_neon:     147.7  108.8   92.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      58d15492
    • Martin Storsjö's avatar
      aarch64: vp8: Port missing epel8 functions from arm version · cc7ba00c
      Martin Storsjö authored
                            Cortex A53     A72     A73
      vp8_put_epel8_h4_c:       2594.8  1159.6  1374.8
      vp8_put_epel8_h4_neon:     506.4   244.2   314.0
      vp8_put_epel8_h6_c:       3445.8  1677.1  1811.3
      vp8_put_epel8_h6_neon:     634.4   371.7   433.0
      vp8_put_epel8_v4_c:       2614.0  1174.8  1378.0
      vp8_put_epel8_v4_neon:     321.0   221.7   235.8
      vp8_put_epel8_v6_c:       3635.5  1703.0  2079.2
      vp8_put_epel8_v6_neon:     416.9   317.0   295.5
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      cc7ba00c
    • Martin Storsjö's avatar
      aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version · 52c9b0a6
      Martin Storsjö authored
                           Cortex A53    A72    A73
      vp8_luma_dc_wht_c:        115.7   75.7   90.7
      vp8_luma_dc_wht_neon:      60.7   41.2   45.7
      vp8_idct_dc_add4uv_c:     376.1  262.9  282.5
      vp8_idct_dc_add4uv_neon:   52.0   29.0   37.0
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      52c9b0a6
    • Martin Storsjö's avatar
      aarch64: vp8: Fix a typo in a comment · c513fcd7
      Martin Storsjö authored
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      c513fcd7
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: vp8: Move the vp8dsp makefile entries to the right places · b4b27dce
      Martin Storsjö authored
      Even if NEON would be disabled, the init functions should be built
      as they are called as long as ARCH_AARCH64 is set.
      
      These functions are part of a generic DSP subsytem, not tied directly
      to one decoder. (They should be built if the vp7 decoder is enabled,
      even if the vp8 decoder is disabled.)
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      b4b27dce
    • Martin Storsjö's avatar
      aarch64: vp8: Remove superfluous includes · ad32f7b1
      Martin Storsjö authored
      This fixes building with MSVC, which lacks unistd.h.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      ad32f7b1
    • Martin Storsjö's avatar
      aarch64: vp8: Use the proper aarch64 form for conditional branches · 85bfaa49
      Martin Storsjö authored
      The previous form also does seem to assemble on current tools,
      but I think it might fail on some older aarch64 tools.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      85bfaa49
    • Martin Storsjö's avatar
      2eeac799
    • Martin Storsjö's avatar
      aarch64: vp8: Fix assembling with clang · 26d7af4c
      Martin Storsjö authored
      This also partially fixes assembling with MS armasm64 (via
      gas-preprocessor).
      
      The movrel macro invocations need to pass the offset via a separate
      parameter. Mach-o and COFF relocations don't allow a negative
      offset to a symbol, which is handled properly if the offset is passed
      via the parameter. If no offset parameter is given, the macro
      evaluates to something like "adrp x17, subpel_filters-16+(0)", which
      older clang versions also fail to parse (the older clang versions
      only support one single offset term, although it can be a parenthesis.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      26d7af4c
    • Magnus Röös's avatar
      libavcodec: vp8 neon optimizations for aarch64 · 0801853e
      Magnus Röös authored
      Partial port of the ARM Neon for aarch64.
      
      Benchmarks from fate:
      
      benchmarking with Linux Perf Monitoring API
      nop: 58.6
      checkasm: using random seed 1760970128
      NEON:
       - vp8dsp.idct       [OK]
       - vp8dsp.mc         [OK]
       - vp8dsp.loopfilter [OK]
      checkasm: all 21 tests passed
      vp8_idct_add_c: 201.6
      vp8_idct_add_neon: 83.1
      vp8_idct_dc_add_c: 107.6
      vp8_idct_dc_add_neon: 33.8
      vp8_idct_dc_add4y_c: 426.4
      vp8_idct_dc_add4y_neon: 59.4
      vp8_loop_filter8uv_h_c: 688.1
      vp8_loop_filter8uv_h_neon: 216.3
      vp8_loop_filter8uv_inner_h_c: 649.3
      vp8_loop_filter8uv_inner_h_neon: 195.3
      vp8_loop_filter8uv_inner_v_c: 544.8
      vp8_loop_filter8uv_inner_v_neon: 131.3
      vp8_loop_filter8uv_v_c: 706.1
      vp8_loop_filter8uv_v_neon: 141.1
      vp8_loop_filter16y_h_c: 668.8
      vp8_loop_filter16y_h_neon: 242.8
      vp8_loop_filter16y_inner_h_c: 647.3
      vp8_loop_filter16y_inner_h_neon: 224.6
      vp8_loop_filter16y_inner_v_c: 647.8
      vp8_loop_filter16y_inner_v_neon: 128.8
      vp8_loop_filter16y_v_c: 721.8
      vp8_loop_filter16y_v_neon: 154.3
      vp8_loop_filter_simple_h_c: 387.8
      vp8_loop_filter_simple_h_neon: 187.6
      vp8_loop_filter_simple_v_c: 384.1
      vp8_loop_filter_simple_v_neon: 78.6
      vp8_put_epel8_h4v4_c: 3971.1
      vp8_put_epel8_h4v4_neon: 855.1
      vp8_put_epel8_h4v6_c: 5060.1
      vp8_put_epel8_h4v6_neon: 989.6
      vp8_put_epel8_h6v4_c: 4320.8
      vp8_put_epel8_h6v4_neon: 1007.3
      vp8_put_epel8_h6v6_c: 5449.3
      vp8_put_epel8_h6v6_neon: 1158.1
      vp8_put_epel16_h6_c: 6683.8
      vp8_put_epel16_h6_neon: 831.8
      vp8_put_epel16_h6v6_c: 11110.8
      vp8_put_epel16_h6v6_neon: 2214.8
      vp8_put_epel16_v6_c: 7024.8
      vp8_put_epel16_v6_neon: 799.6
      vp8_put_pixels8_c: 112.8
      vp8_put_pixels8_neon: 78.1
      vp8_put_pixels16_c: 131.3
      vp8_put_pixels16_neon: 129.8
      
      This contains a fix to include guards by Carl Eugen Hoyos.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      0801853e
    • Luca Barbato's avatar
      Unbreak travis on macos · 899ee030
      Luca Barbato authored
      899ee030
  4. 16 Feb, 2019 11 commits
  5. 12 Feb, 2019 1 commit
    • Sven Dueking's avatar
      srt: Set srto_sender flag to sender srt socket · 90b15f60
      Sven Dueking authored
      SRT API Documentation:
      This flag is superfluous if both parties are at least version 1.3.0
      (this shall be enforced by setting this value to SRTO_MINVERSION if
      you expect that it be true) and therefore support HSv5 handshake,
      where the SRT extended handshake is done with the overall handshake
      process.
      
      This flag is however obligatory if at least one party may be using
      SRT below version 1.3.0 and does not support HSv5.
      90b15f60
  6. 27 Jan, 2019 1 commit
  7. 26 Jan, 2019 5 commits
    • Martin Storsjö's avatar
      libopenh264dec: Use a newer decoding entry point function · eec93e57
      Martin Storsjö authored
      The "new" entry point actually has existed since OpenH264 1.4 in
      2015 and is the the recommended decoding entry point.
      
      The name of this function, DecodeFrameNoDelay, is rather backwards
      considering that it doesn't return the latest decoded frame immediately,
      but actually does proper delaying and reordering of frames.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      eec93e57
    • Janne Grunau's avatar
      h264/aarch64: add intra loop filter neon asm · 28a8b541
      Janne Grunau authored
      Add my neon asm from x264 relicensed under the LGPL 2.1 or later. Ported
      (x264 uses nv12 chroma) and optimized.
      
      Cycle count for checkasm --bench on a Snapdragon 820e:
      h264_h_loop_filter_luma_intra_8bpp_c: 60.0
      h264_h_loop_filter_luma_intra_8bpp_neon: 54.2
      h264_v_loop_filter_luma_intra_8bpp_c: 148.3
      h264_v_loop_filter_luma_intra_8bpp_neon: 73.8
      h264_h_loop_filter_chroma_intra_8bpp_c: 27.8
      h264_h_loop_filter_chroma_intra_8bpp_neon: 21.4
      h264_h_loop_filter_chroma_mbaff_intra_8bpp_c: 15.8
      h264_h_loop_filter_chroma_mbaff_intra_8bpp_neon: 15.7
      h264_v_loop_filter_chroma_intra_8bpp_c: 45.8
      h264_v_loop_filter_chroma_intra_8bpp_neon: 17.3
      28a8b541
    • Janne Grunau's avatar
      h264/aarch64: optimize neon loop filter · 846c3d6a
      Janne Grunau authored
      Exit as soon as possible if no filtering will be done.
      
      Improves the checkasm --bench cycle count on a Snapdragon 820e:
      h264_h_loop_filter_luma_8bpp_c:      72.4 ->  72.5
      h264_h_loop_filter_luma_8bpp_neon:   97.1 ->  56.3
      h264_v_loop_filter_luma_8bpp_c:     174.0 -> 173.5
      h264_v_loop_filter_luma_8bpp_neon:   62.9 ->  60.9
      h264_h_loop_filter_chroma_8bpp_c:    30.2 ->  30.3
      h264_h_loop_filter_chroma_8bpp_neon: 51.6 ->  25.7
      h264_v_loop_filter_chroma_8bpp_c:    57.3 ->  57.3
      h264_v_loop_filter_chroma_8bpp_neon: 28.0 ->  24.0
      846c3d6a
    • Janne Grunau's avatar
      checkasm/h264: add loop filter tests · d7f4f5c4
      Janne Grunau authored
      d7f4f5c4
    • Janne Grunau's avatar
      bb515e3a
  8. 25 Jan, 2019 1 commit
    • Martin Storsjö's avatar
      arm: Create proper .rdata sections for COFF · 41cf3e3b
      Martin Storsjö authored
      As .rodata isn't one of the default created sections for COFF, it was
      created as a read-write data section. By using the default .rdata
      section name for COFF, it automatically becomes a read-only data section.
      The existing ".section .rodata" works as intended for ELF though.
      
      This is based on an original patch and diagnose by Tom Tan
      <Tom.Tan@microsoft.com>.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      41cf3e3b
  9. 23 Jan, 2019 1 commit