1. 06 Jan, 2020 1 commit
  2. 04 Jan, 2020 1 commit
    • Sebastian Pop's avatar
      swscale/aarch64: use multiply accumulate and shift-right narrow · c3a17fff
      Sebastian Pop authored
      This patch rewrites the innermost loop of ff_yuv2planeX_8_neon to avoid zips and
      horizontal adds by using fused multiply adds. The patch also uses ld1r to load
      one element and replicate it across all lanes of the vector. The patch also
      improves the clipping code by removing the shift right instructions and
      performing the shift with the shift-right narrow instructions.
      
      I see 8% difference on an m6g instance with neoverse-n1 CPUs:
      $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
      before: t:0.014015 avg:0.014096 max:0.015018 min:0.013971
      after:  t:0.012985 avg:0.013013 max:0.013996 min:0.012818
      
      Tested with `make check` on aarch64-linux.
      Signed-off-by: 's avatarSebastian Pop <spop@amazon.com>
      Reviewed-by: 's avatarClément Bœsch <u@pkh.me>
      Signed-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
      c3a17fff
  3. 31 Dec, 2019 1 commit
  4. 17 Dec, 2019 1 commit
    • Sebastian Pop's avatar
      swscale/aarch64: use multiply accumulate and increase vector factor to 4 · bd831912
      Sebastian Pop authored
      This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate
      and bumps the vectorization factor from 2 to 4.
      The speedup is of 25% on Graviton1 A1 instances based on A-72 cpus:
      
      $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
      before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214
      after:  t:0.032168 avg:0.032215 max:0.033081 min:0.032146
      
      The speedup is of 39% on Graviton2 m6g instances based on Neoverse-N1 cpus:
      $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
      before: t:0.019446 avg:0.019423 max:0.019493 min:0.019181
      after:  t:0.014015 avg:0.014096 max:0.015018 min:0.013971
      
      Tested with `make check` on aarch64-linux.
      Signed-off-by: 's avatarSebastian Pop <spop@amazon.com>
      Reviewed-by: 's avatarJean-Baptiste Kempf <jb@videolan.org>
      Signed-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
      bd831912
  5. 10 Dec, 2019 1 commit
  6. 06 Dec, 2019 1 commit
  7. 01 Nov, 2019 1 commit
  8. 16 Oct, 2019 3 commits
  9. 04 Oct, 2019 2 commits
    • Daniel Kolesa's avatar
      swscale: Fix AltiVec/VSX build with recent GCC · e6625ca4
      Daniel Kolesa authored
      The argument to vec_splat_u16 must be a literal. By making the
      function always inline and marking the arguments const, gcc can
      turn those into literals, and avoid build errors like:
      
      swscale_vsx.c:165:53: error: argument 1 must be a 5-bit signed literal
      
      Fixes #7861.
      Signed-off-by: 's avatarDaniel Kolesa <daniel@octaforge.org>
      Signed-off-by: 's avatarLauri Kasanen <cand@gmx.com>
      e6625ca4
    • Daniel Kolesa's avatar
      swscale: Replace illegal vector keyword usage in altivec code · 1bdb47b7
      Daniel Kolesa authored
      While this technically compiles in current ffmpeg, this is only
      because ffmpeg is compiled in strict ISO C mode, which disables
      the builtin 'vector' keyword for AltiVec/VSX. Instead this gets
      replaced with a macro inside altivec.h, which defines vector to
      be actually __vector, which accepts random types.
      
      Normally, the vector keyword should be used only with plain
      scalar non-typedef types, such as unsigned int. But we have the
      vec_(s|u)(8|16|32) macros, which can be used in a portable manner,
      in util_altivec.h in libavutil.
      
      This is also consistent with other AltiVec/VSX code elsewhere in
      the tree.
      
      Fixes #7861.
      Signed-off-by: 's avatarDaniel Kolesa <daniel@octaforge.org>
      Signed-off-by: 's avatarLauri Kasanen <cand@gmx.com>
      1bdb47b7
  10. 28 Sep, 2019 2 commits
  11. 27 Sep, 2019 1 commit
  12. 26 Sep, 2019 1 commit
  13. 09 Sep, 2019 1 commit
  14. 06 Sep, 2019 1 commit
  15. 13 Aug, 2019 1 commit
    • Chip Kerchner's avatar
      lsws/ppc/yuv2rgb_altivec: Replace vec_lvsl/vec_perm with vec_xl · 3a557c5d
      Chip Kerchner authored
      gcc 6.x and 7.x generate wrong code for little endian machines
      for the vec_lvsl/vec_perm instruction combos in some cases.
      The bug was fixed in version 8.x
      If these instructions are replaced with vec_xl, the problem goes
      away for all versions of the compilers.
      
      Fixes ticket #7124.
      3a557c5d
  16. 21 Jul, 2019 2 commits
  17. 13 May, 2019 2 commits
  18. 12 May, 2019 2 commits
    • Philip Langdale's avatar
      swscale: Add test for isSemiPlanarYUV to pixdesc_query · 4fa4f1d7
      Philip Langdale authored
      Lauri had asked me what the semi planar formats were and that reminded
      me that we could add it to pixdesc_query so we know exactly what the
      list is.
      4fa4f1d7
    • Philip Langdale's avatar
      swscale: Add support for NV24 and NV42 · cd483180
      Philip Langdale authored
      The implementation is pretty straight-forward. Most of the existing
      NV12 codepaths work regardless of subsampling and are re-used as is.
      Where necessary I wrote the slightly different NV24 versions.
      
      Finally, the one thing that confused me for a long time was the
      asm specific x86 path that did an explicit exclusion check for NV12.
      I replaced that with a semi-planar check and also updated the
      equivalent PPC code, which Lauri kindly checked.
      cd483180
  19. 07 May, 2019 4 commits
    • Lauri Kasanen's avatar
      e25bddf5
    • Lauri Kasanen's avatar
      swscale/ppc: VSX-optimize hScale16To* · a2a16206
      Lauri Kasanen authored
      ./ffmpeg -loop 1 -s 1200x1440 -i tux16.png \
          -s 2400x720 -f rawvideo -y -vframes 5 -pix_fmt yuv420p16le -nostats test.raw
      
      ./ffmpeg -loop 1 -s 1200x1440 -i tux16.png \
          -s 2400x720 -f rawvideo -y -vframes 5 -pix_fmt yuv420p -nostats test.raw
      
      32-bit mul, power8 only
      
      2x speedup for hScale8To19_vsx (x86 SSE2 is 2.37):
        30896 UNITS in hscale,    8192 runs,      0 skips
        63956 UNITS in hscale,    8192 runs,      0 skips
      
      2.06 for hScale16To15_vsx:
        30531 UNITS in hscale,    8192 runs,      0 skips
        63161 UNITS in hscale,    8192 runs,      0 skips
      a2a16206
    • Lauri Kasanen's avatar
      swscale/ppc: Indent · 3437111f
      Lauri Kasanen authored
      3437111f
    • Lauri Kasanen's avatar
      swscale/ppc: VSX-optimize hScale8To19 · 9456adc2
      Lauri Kasanen authored
      ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \
          -s 2400x720 -f rawvideo -y -vframes 5 -pix_fmt yuv420p16le -nostats test.raw
      
      2.26 speedup (x86 SSE2 is 2.32):
        23772 UNITS in hscale,    4096 runs,      0 skips
        53862 UNITS in hscale,    4096 runs,      0 skips
      9456adc2
  20. 30 Apr, 2019 1 commit
    • Lauri Kasanen's avatar
      swscale/ppc: VSX-optimize hscale_fast · d0e4d042
      Lauri Kasanen authored
      ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags fast_bilinear \
              -s 2400x720 -f rawvideo -vframes 5 -pix_fmt abgr -nostats test.raw
      
      4.27 speedup for hyscale_fast:
        24796 UNITS in hyscale_fast,    4096 runs,      0 skips
         5797 UNITS in hyscale_fast,    4096 runs,      0 skips
      
      4.48 speedup for hcscale_fast:
        19911 UNITS in hcscale_fast,    4095 runs,      1 skips
         4437 UNITS in hcscale_fast,    4096 runs,      0 skips
      d0e4d042
  21. 11 Apr, 2019 1 commit
    • Lauri Kasanen's avatar
      swscale/ppc: VSX-optimize non-full-chroma yuv2rgb_2 · ce92ee4b
      Lauri Kasanen authored
      ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags fast_bilinear \
              -s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \
              -cpuflags 0 -v error -
      
      32-bit mul, power8 only.
      
      ~2x speedup:
      
      rgb24
        24431 UNITS in yuv2packed2,   16384 runs,      0 skips
        13783 UNITS in yuv2packed2,   16383 runs,      1 skips
      bgr24
        24396 UNITS in yuv2packed2,   16384 runs,      0 skips
        14059 UNITS in yuv2packed2,   16384 runs,      0 skips
      rgba
        26815 UNITS in yuv2packed2,   16383 runs,      1 skips
        12797 UNITS in yuv2packed2,   16383 runs,      1 skips
      bgra
        27060 UNITS in yuv2packed2,   16384 runs,      0 skips
        13138 UNITS in yuv2packed2,   16384 runs,      0 skips
      argb
        26998 UNITS in yuv2packed2,   16384 runs,      0 skips
        12728 UNITS in yuv2packed2,   16381 runs,      3 skips
      bgra
        26651 UNITS in yuv2packed2,   16384 runs,      0 skips
        13124 UNITS in yuv2packed2,   16384 runs,      0 skips
      
      This is a low speedup, but the x86 mmx version also gets only ~2x. The mmx version
      is also heavily inaccurate, while the vsx version has high accuracy.
      ce92ee4b
  22. 07 Apr, 2019 3 commits
    • Lauri Kasanen's avatar
      swscale/ppc: VSX-optimize yuv2rgb_full_X · 8607e29f
      Lauri Kasanen authored
      ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \
                      -s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \
                      -cpuflags 0 -v error -
      
      32-bit mul, power8 only.
      
      ~6.4x speedup:
      
      rgb24
       214278 UNITS in yuv2packedX,   16384 runs,      0 skips
        33249 UNITS in yuv2packedX,   16384 runs,      0 skips
      bgr24
       214616 UNITS in yuv2packedX,   16384 runs,      0 skips
        33233 UNITS in yuv2packedX,   16384 runs,      0 skips
      rgba
       214517 UNITS in yuv2packedX,   16384 runs,      0 skips
        33271 UNITS in yuv2packedX,   16384 runs,      0 skips
      bgra
       214973 UNITS in yuv2packedX,   16384 runs,      0 skips
        33397 UNITS in yuv2packedX,   16384 runs,      0 skips
      argb
       214613 UNITS in yuv2packedX,   16384 runs,      0 skips
        33310 UNITS in yuv2packedX,   16384 runs,      0 skips
      bgra
       214637 UNITS in yuv2packedX,   16384 runs,      0 skips
        33330 UNITS in yuv2packedX,   16384 runs,      0 skips
      8607e29f
    • Lauri Kasanen's avatar
      swscale/ppc: VSX-optimize yuv2rgb_full_2 · 3256e949
      Lauri Kasanen authored
      ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags area \
                  -s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \
                  -cpuflags 0 -v error -
      
      32-bit mul, power8 only.
      
      ~4x speedup:
      
      rgb24
        52763 UNITS in yuv2packed2,   16384 runs,      0 skips
        13453 UNITS in yuv2packed2,   16384 runs,      0 skips
      bgr24
        53144 UNITS in yuv2packed2,   16384 runs,      0 skips
        13616 UNITS in yuv2packed2,   16384 runs,      0 skips
      rgba
        52796 UNITS in yuv2packed2,   16384 runs,      0 skips
        12904 UNITS in yuv2packed2,   16384 runs,      0 skips
      bgra
        52732 UNITS in yuv2packed2,   16384 runs,      0 skips
        13262 UNITS in yuv2packed2,   16384 runs,      0 skips
      argb
        52661 UNITS in yuv2packed2,   16384 runs,      0 skips
        12879 UNITS in yuv2packed2,   16384 runs,      0 skips
      bgra
        52662 UNITS in yuv2packed2,   16384 runs,      0 skips
        12932 UNITS in yuv2packed2,   16384 runs,      0 skips
      3256e949
    • Lauri Kasanen's avatar
      swscale/ppc: VSX-optimize non-full-chroma yuv2rgb_1 · 50e672bc
      Lauri Kasanen authored
      ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags fast_bilinear \
              -s 1200x1440 -f null -vframes 100 -pix_fmt $i -nostats \
              -cpuflags 0 -v error -
      
      32-bit mul, power8 only.
      
      1.8-2.3x speedup:
      
      rgb24
        18192 UNITS in yuv2packed1,   32767 runs,      1 skips
         9983 UNITS in yuv2packed1,   32760 runs,      8 skips
      bgr24
        18665 UNITS in yuv2packed1,   32766 runs,      2 skips
         9925 UNITS in yuv2packed1,   32763 runs,      5 skips
      rgba
        20239 UNITS in yuv2packed1,   32767 runs,      1 skips
         8794 UNITS in yuv2packed1,   32759 runs,      9 skips
      bgra
        20354 UNITS in yuv2packed1,   32768 runs,      0 skips
         8770 UNITS in yuv2packed1,   32761 runs,      7 skips
      argb
        20185 UNITS in yuv2packed1,   32768 runs,      0 skips
         8761 UNITS in yuv2packed1,   32761 runs,      7 skips
      bgra
        20360 UNITS in yuv2packed1,   32766 runs,      2 skips
         8759 UNITS in yuv2packed1,   32764 runs,      4 skips
      
      This is a low speedup, but the x86 mmx version also gets only ~2x. The mmx version
      is also heavily inaccurate, while the vsx version has high accuracy.
      50e672bc
  23. 31 Mar, 2019 3 commits
    • Lauri Kasanen's avatar
      swscale/ppc: VSX-optimize yuv2422_X · 7adce3e6
      Lauri Kasanen authored
      ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \
                -s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \
                -cpuflags 0 -v error -
      
      7.2x speedup:
      
      yuyv422
       126354 UNITS in yuv2packedX,   16384 runs,      0 skips
        16383 UNITS in yuv2packedX,   16382 runs,      2 skips
      yvyu422
       117669 UNITS in yuv2packedX,   16384 runs,      0 skips
        16271 UNITS in yuv2packedX,   16379 runs,      5 skips
      uyvy422
       117310 UNITS in yuv2packedX,   16384 runs,      0 skips
        16226 UNITS in yuv2packedX,   16382 runs,      2 skips
      7adce3e6
    • Lauri Kasanen's avatar
      swscale/ppc: VSX-optimize yuv2422_2 · 9a2db4dc
      Lauri Kasanen authored
      ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags area \
                      -s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \
                      -cpuflags 0 -v error -
      
      5.1x speedup:
      
      yuyv422
        19339 UNITS in yuv2packed2,   16384 runs,      0 skips
         3718 UNITS in yuv2packed2,   16383 runs,      1 skips
      yvyu422
        19438 UNITS in yuv2packed2,   16384 runs,      0 skips
         3800 UNITS in yuv2packed2,   16380 runs,      4 skips
      uyvy422
        19128 UNITS in yuv2packed2,   16384 runs,      0 skips
         3721 UNITS in yuv2packed2,   16380 runs,      4 skips
      9a2db4dc
    • Lauri Kasanen's avatar
      swscale/ppc: VSX-optimize yuv2422_1 · a6a31ca3
      Lauri Kasanen authored
      ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \
                  -s 1200x1440 -f null -vframes 100 -pix_fmt $i -nostats \
                  -cpuflags 0 -v error -
      
      15.3x speedup:
      
      yuyv422
        14513 UNITS in yuv2packed1,   32768 runs,      0 skips
          949 UNITS in yuv2packed1,   32767 runs,      1 skips
      yvyu422
        14516 UNITS in yuv2packed1,   32767 runs,      1 skips
          943 UNITS in yuv2packed1,   32767 runs,      1 skips
      uyvy422
        14530 UNITS in yuv2packed1,   32767 runs,      1 skips
          941 UNITS in yuv2packed1,   32766 runs,      2 skips
      a6a31ca3
  24. 28 Mar, 2019 2 commits
  25. 27 Mar, 2019 1 commit
    • Lauri Kasanen's avatar
      swscale/ppc: VSX-optimize yuv2rgb_full · 681957b8
      Lauri Kasanen authored
      ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \
              -s 1200x1440 -f null -vframes 100 -pix_fmt $i -nostats \
              -cpuflags 0 -v error -
      
      This uses 32-bit mul, so POWER8 only.
      
      The following output formats get about 4.5x speedup:
      
      rgb24
        39980 UNITS in yuv2packed1,   32768 runs,      0 skips
         8774 UNITS in yuv2packed1,   32768 runs,      0 skips
      bgr24
        40069 UNITS in yuv2packed1,   32768 runs,      0 skips
         8772 UNITS in yuv2packed1,   32766 runs,      2 skips
      rgba
        39759 UNITS in yuv2packed1,   32768 runs,      0 skips
         8681 UNITS in yuv2packed1,   32767 runs,      1 skips
      bgra
        39729 UNITS in yuv2packed1,   32768 runs,      0 skips
         8696 UNITS in yuv2packed1,   32766 runs,      2 skips
      argb
        39766 UNITS in yuv2packed1,   32768 runs,      0 skips
         8672 UNITS in yuv2packed1,   32766 runs,      2 skips
      bgra
        39784 UNITS in yuv2packed1,   32768 runs,      0 skips
         8659 UNITS in yuv2packed1,   32767 runs,      1 skips
      681957b8