1. 24 Jan, 2017 22 commits
    • Martin Storsjö's avatar
      aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC · 638eceed
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This has mostly got the same differences to the 8 bit version as
      in the arm version. For the horizontal filters, we do 16 pixels
      in parallel as well. For the 8 pixel wide vertical filters, we can
      accumulate 4 rows before storing, just as in the 8 bit version.
      
      Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                                 ARM   AArch64
      vp9_avg4_10bpp_neon:                      35.7      30.7
      vp9_avg8_10bpp_neon:                      93.5      84.7
      vp9_avg16_10bpp_neon:                    324.4     296.6
      vp9_avg32_10bpp_neon:                   1236.5    1148.2
      vp9_avg64_10bpp_neon:                   4639.6    4571.1
      vp9_avg_8tap_smooth_4h_10bpp_neon:       130.0     128.0
      vp9_avg_8tap_smooth_4hv_10bpp_neon:      440.0     440.5
      vp9_avg_8tap_smooth_4v_10bpp_neon:       114.0     105.5
      vp9_avg_8tap_smooth_8h_10bpp_neon:       327.0     314.0
      vp9_avg_8tap_smooth_8hv_10bpp_neon:      918.7     865.4
      vp9_avg_8tap_smooth_8v_10bpp_neon:       330.0     300.2
      vp9_avg_8tap_smooth_16h_10bpp_neon:     1187.5    1155.5
      vp9_avg_8tap_smooth_16hv_10bpp_neon:    2663.1    2591.0
      vp9_avg_8tap_smooth_16v_10bpp_neon:     1107.4    1078.3
      vp9_avg_8tap_smooth_64h_10bpp_neon:    17754.6   17454.7
      vp9_avg_8tap_smooth_64hv_10bpp_neon:   33285.2   33001.5
      vp9_avg_8tap_smooth_64v_10bpp_neon:    16066.9   16048.6
      vp9_put4_10bpp_neon:                      25.5      21.7
      vp9_put8_10bpp_neon:                      56.0      52.0
      vp9_put16_10bpp_neon/armv8:              183.0     163.1
      vp9_put32_10bpp_neon/armv8:              678.6     563.1
      vp9_put64_10bpp_neon/armv8:             2679.9    2195.8
      vp9_put_8tap_smooth_4h_10bpp_neon:       120.0     118.0
      vp9_put_8tap_smooth_4hv_10bpp_neon:      435.2     435.0
      vp9_put_8tap_smooth_4v_10bpp_neon:       107.0      98.2
      vp9_put_8tap_smooth_8h_10bpp_neon:       303.0     290.0
      vp9_put_8tap_smooth_8hv_10bpp_neon:      893.7     828.7
      vp9_put_8tap_smooth_8v_10bpp_neon:       305.5     263.5
      vp9_put_8tap_smooth_16h_10bpp_neon:     1089.1    1059.2
      vp9_put_8tap_smooth_16hv_10bpp_neon:    2578.8    2452.4
      vp9_put_8tap_smooth_16v_10bpp_neon:     1009.5     933.5
      vp9_put_8tap_smooth_64h_10bpp_neon:    16223.4   15918.6
      vp9_put_8tap_smooth_64hv_10bpp_neon:   32153.0   31016.2
      vp9_put_8tap_smooth_64v_10bpp_neon:    14516.5   13748.1
      
      These are generally about as fast as the corresponding ARM
      routines on the same CPU (at least on the A53), in most cases
      marginally faster.
      
      The speedup vs C code is around 4-9x.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      638eceed
    • Martin Storsjö's avatar
      aarch64: vp9dsp: Restructure the bpp checks · 48ad3fe1
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This is more in line with how it will be extended for more bitdepths.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      48ad3fe1
    • Martin Storsjö's avatar
      arm: Add NEON optimizations for 10 and 12 bit vp9 loop filter · 1e5d87ee
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This is pretty much similar to the 8 bpp version, but in some senses
      simpler. All input pixels are 16 bits, and all intermediates also fit
      in 16 bits, so there's no lengthening/narrowing in the filter at all.
      
      For the full 16 pixel wide filter, we can only process 4 pixels at a time
      (using an implementation very much similar to the one for 8 bpp),
      but we can do 8 pixels at a time for the 4 and 8 pixel wide filters with
      a different implementation of the core filter.
      
      Examples of relative speedup compared to the C version, from checkasm:
                                         Cortex    A7     A8     A9    A53
      vp9_loop_filter_h_4_8_10bpp_neon:          1.83   2.16   1.40   2.09
      vp9_loop_filter_h_8_8_10bpp_neon:          1.39   1.67   1.24   1.70
      vp9_loop_filter_h_16_8_10bpp_neon:         1.56   1.47   1.10   1.81
      vp9_loop_filter_h_16_16_10bpp_neon:        1.94   1.69   1.33   2.24
      vp9_loop_filter_mix2_h_44_16_10bpp_neon:   2.01   2.27   1.67   2.39
      vp9_loop_filter_mix2_h_48_16_10bpp_neon:   1.84   2.06   1.45   2.19
      vp9_loop_filter_mix2_h_84_16_10bpp_neon:   1.89   2.20   1.47   2.29
      vp9_loop_filter_mix2_h_88_16_10bpp_neon:   1.69   2.12   1.47   2.08
      vp9_loop_filter_mix2_v_44_16_10bpp_neon:   3.16   3.98   2.50   4.05
      vp9_loop_filter_mix2_v_48_16_10bpp_neon:   2.84   3.64   2.25   3.77
      vp9_loop_filter_mix2_v_84_16_10bpp_neon:   2.65   3.45   2.16   3.54
      vp9_loop_filter_mix2_v_88_16_10bpp_neon:   2.55   3.30   2.16   3.55
      vp9_loop_filter_v_4_8_10bpp_neon:          2.85   3.97   2.24   3.68
      vp9_loop_filter_v_8_8_10bpp_neon:          2.27   3.19   1.96   3.08
      vp9_loop_filter_v_16_8_10bpp_neon:         3.42   2.74   2.26   4.40
      vp9_loop_filter_v_16_16_10bpp_neon:        2.86   2.44   1.93   3.88
      
      The speedup vs C code measured in checkasm is around 1.1-4x.
      These numbers are quite inconclusive though, since the checkasm test
      runs multiple filterings on top of each other, so later rounds might
      end up with different codepaths (different decisions on which filter
      to apply, based on input pixel differences).
      
      Based on START_TIMER/STOP_TIMER wrapping around a few individual
      functions, the speedup vs C code is around 2-4x.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      1e5d87ee
    • Martin Storsjö's avatar
      arm: Add NEON optimizations for 10 and 12 bit vp9 itxfm · 2ed67eba
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This is structured similarly to the 8 bit version. In the 8 bit
      version, the coefficients are 16 bits, and intermediates are 32 bits.
      
      Here, the coefficients are 32 bit. For the 4x4 transforms for 10 bit
      content, the intermediates also fit in 32 bits, but for all other
      transforms (4x4 for 12 bit content, and 8x8 and larger for both 10
      and 12 bit) the intermediates are 64 bit.
      
      For the existing 8 bit case, the 8x8 transform fit all coefficients in
      registers; for 10/12 bit, when the coefficients are 32 bit, the 8x8
      transform also has to be done in slices of 4 pixels (just as 16x16 and
      32x32 for 8 bit).
      
      The slice width also shrinks from 4 elements to 2 elements in parallel
      for the 16x16 and 32x32 cases.
      
      The 16 bit coefficients from idct_coeffs and similar tables also need
      to be lenghtened to 32 bit in order to be used in multiplication with
      vectors with 32 bit elements. This leads to the fixed coefficient
      vectors needing more space, leading to more cases where they have to
      be reloaded within the transform (in iadst16).
      
      This technically would need testing in checkasm for subpartitions
      in increments of 2, but that slows down normal checkasm runs
      excessively.
      
      Examples of relative speedup compared to the C version, from checkasm:
                                           Cortex    A7     A8     A9    A53
      vp9_inv_adst_adst_4x4_sub4_add_10_neon:      4.83  11.36   5.22   6.77
      vp9_inv_adst_adst_8x8_sub8_add_10_neon:      4.12   7.60   4.06   4.84
      vp9_inv_adst_adst_16x16_sub16_add_10_neon:   3.93   8.16   4.52   5.35
      vp9_inv_dct_dct_4x4_sub1_add_10_neon:        1.36   2.57   1.41   1.61
      vp9_inv_dct_dct_4x4_sub4_add_10_neon:        4.24   8.66   5.06   5.81
      vp9_inv_dct_dct_8x8_sub1_add_10_neon:        2.63   4.18   1.68   2.87
      vp9_inv_dct_dct_8x8_sub4_add_10_neon:        4.52   9.47   4.24   5.39
      vp9_inv_dct_dct_8x8_sub8_add_10_neon:        3.45   7.34   3.45   4.30
      vp9_inv_dct_dct_16x16_sub1_add_10_neon:      3.56   6.21   2.47   4.32
      vp9_inv_dct_dct_16x16_sub2_add_10_neon:      5.68  12.73   5.28   7.07
      vp9_inv_dct_dct_16x16_sub8_add_10_neon:      4.42   9.28   4.24   5.45
      vp9_inv_dct_dct_16x16_sub16_add_10_neon:     3.41   7.29   3.35   4.19
      vp9_inv_dct_dct_32x32_sub1_add_10_neon:      4.52   8.35   3.83   6.40
      vp9_inv_dct_dct_32x32_sub2_add_10_neon:      5.86  13.19   6.14   7.04
      vp9_inv_dct_dct_32x32_sub16_add_10_neon:     4.29   8.11   4.59   5.06
      vp9_inv_dct_dct_32x32_sub32_add_10_neon:     3.31   5.70   3.56   3.84
      vp9_inv_wht_wht_4x4_sub4_add_10_neon:        1.89   2.80   1.82   1.97
      
      The speedup compared to the C functions is around 1.3 to 7x for the
      full transforms, even higher for the smaller subpartitions.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      2ed67eba
    • Martin Storsjö's avatar
      arm: Add NEON optimizations for 10 and 12 bit vp9 MC · a4d4bad7
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      The plain pixel put/copy functions are used from the 8 bit version,
      for the double size (e.g. put16 uses ff_vp9_copy32_neon), and a new
      copy128 is added.
      
      Compared with the 8 bit version, the filters can no longer use the
      trick to accumulate in 16 bit with only saturation at the end, but now
      the accumulators need to be 32 bit. This avoids the need to keep track
      of which filter index is the largest though, reducing the size of the
      executable code for these filters.
      
      For the horizontal filters, we only do 4 or 8 pixels wide in parallel
      (while doing two rows at a time), since we don't have enough register
      space to filter 16 pixels wide.
      
      For the vertical filters, we still do 4 and 8 pixels in parallel just
      as in the 8 bit case, but we need to store the output after every 2
      rows instead of after every 4 rows.
      
      Examples of relative speedup compared to the C version, from checkasm:
                                     Cortex    A7     A8     A9    A53
      vp9_avg4_10bpp_neon:                   2.25   2.44   3.05   2.16
      vp9_avg8_10bpp_neon:                   3.66   8.48   3.86   3.50
      vp9_avg16_10bpp_neon:                  3.39   8.26   3.37   2.72
      vp9_avg32_10bpp_neon:                  4.03  10.20   4.07   3.42
      vp9_avg64_10bpp_neon:                  4.15  10.01   4.13   3.70
      vp9_avg_8tap_smooth_4h_10bpp_neon:     3.38   6.22   3.41   4.75
      vp9_avg_8tap_smooth_4hv_10bpp_neon:    3.89   6.39   4.30   5.32
      vp9_avg_8tap_smooth_4v_10bpp_neon:     5.32   9.73   6.34   7.31
      vp9_avg_8tap_smooth_8h_10bpp_neon:     4.45   9.40   4.68   6.87
      vp9_avg_8tap_smooth_8hv_10bpp_neon:    4.64   8.91   5.44   6.47
      vp9_avg_8tap_smooth_8v_10bpp_neon:     6.44  13.42   8.68   8.79
      vp9_avg_8tap_smooth_64h_10bpp_neon:    4.66   9.02   4.84   7.71
      vp9_avg_8tap_smooth_64hv_10bpp_neon:   4.61   9.14   4.92   7.10
      vp9_avg_8tap_smooth_64v_10bpp_neon:    6.90  14.13   9.57  10.41
      vp9_put4_10bpp_neon:                   1.33   1.46   2.09   1.33
      vp9_put8_10bpp_neon:                   1.57   3.42   1.83   1.84
      vp9_put16_10bpp_neon:                  1.55   4.78   2.17   1.89
      vp9_put32_10bpp_neon:                  2.06   5.35   2.14   2.30
      vp9_put64_10bpp_neon:                  3.00   2.41   1.95   1.66
      vp9_put_8tap_smooth_4h_10bpp_neon:     3.19   5.81   3.31   4.63
      vp9_put_8tap_smooth_4hv_10bpp_neon:    3.86   6.22   4.32   5.21
      vp9_put_8tap_smooth_4v_10bpp_neon:     5.40   9.77   6.08   7.21
      vp9_put_8tap_smooth_8h_10bpp_neon:     4.22   8.41   4.46   6.63
      vp9_put_8tap_smooth_8hv_10bpp_neon:    4.56   8.51   5.39   6.25
      vp9_put_8tap_smooth_8v_10bpp_neon:     6.60  12.43   8.17   8.89
      vp9_put_8tap_smooth_64h_10bpp_neon:    4.41   8.59   4.54   7.49
      vp9_put_8tap_smooth_64hv_10bpp_neon:   4.43   8.58   5.34   6.63
      vp9_put_8tap_smooth_64v_10bpp_neon:    7.26  13.92   9.27  10.92
      
      For the larger 8tap filters, the speedup vs C code is around 4-14x.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      a4d4bad7
    • Martin Storsjö's avatar
      arm: vp9dsp: Restructure the bpp checks · cda9a3e8
      Martin Storsjö authored
      This work is sponsored by, and copyright, Google.
      
      This is more in line with how it will be extended for more bitdepths.
      Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
      cda9a3e8
    • Clément Bœsch's avatar
      Merge commit 'fd5e6a09' · 1400598c
      Clément Bœsch authored
      * commit 'fd5e6a09':
        x86util: Extend SPLATW for avx2
      
      This commit is a noop, see 1ace9573
      (only libavutil/x86/x86util.asm chunk).
      Merged-by: 's avatarClément Bœsch <u@pkh.me>
      1400598c
    • Clément Bœsch's avatar
      Merge commit '37961044' · f84ece0a
      Clément Bœsch authored
      * commit '37961044':
        checkasm: arm: Ignore changes to bits 0-4 and 7 of FPSCR
        cheackasm/arm: remove NEON instructions from checkasm_checked_call_vfp
        checkasm: arm: Don't start new const blocks for each string
      
      This merge is a noop: the changes were included in 9f1c81e5.
      Merged-by: 's avatarClément Bœsch <u@pkh.me>
      f84ece0a
    • Clément Bœsch's avatar
      Merge commit '5ece6911' · 727c463f
      Clément Bœsch authored
      * commit '5ece6911':
        apichanges: Fill in missing hashes and dates
      
      This commit is a noop as we need to fill with our own hashes.
      Merged-by: 's avatarClément Bœsch <u@pkh.me>
      727c463f
    • Clément Bœsch's avatar
      Merge commit 'facdfe40' · 4181d774
      Clément Bœsch authored
      * commit 'facdfe40':
        swscale: Add proper ff_ prefix to init functions
      
      This commit is a noop, see e8c37160
      
      I'm keeping our ff_sws_ vs ff_ since we use ff_sws_ in other places in
      swscale.
      Merged-by: 's avatarClément Bœsch <u@pkh.me>
      4181d774
    • Clément Bœsch's avatar
      Merge commit 'c0fd2fb2' · 4ad5b936
      Clément Bœsch authored
      * commit 'c0fd2fb2':
        swscale: Rename sws_context_class to ff_sws_context_class
      
      This commit is a noop, see 8bfbc8c5Merged-by: 's avatarClément Bœsch <u@pkh.me>
      4ad5b936
    • Clément Bœsch's avatar
      Merge commit '71a04721' · 9f1c81e5
      Clément Bœsch authored
      * commit '71a04721':
        checkasm: arm: report the first clobbered register in checkasm_checked_call
      
      Also includes 446353ea, 59aeed93, and 37961044 to avoid breaking
      too much stuff.
      Merged-by: 's avatarClément Bœsch <u@pkh.me>
      9f1c81e5
    • Michael Niedermayer's avatar
      avcodec/mjpegdec: Check remaining bitstream in ljpeg_decode_yuv_scan() · 755933cb
      Michael Niedermayer authored
      Fixes timeout
      Fixes: 445/fuzz-3-ffmpeg_VIDEO_AV_CODEC_ID_MJPEG_fuzzer
      Fixes: 456/fuzz-2-ffmpeg_VIDEO_AV_CODEC_ID_JPEGLS_fuzzer
      
      Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpegSigned-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
      755933cb
    • Clément Bœsch's avatar
      Merge commit 'a8fce24b' · 8504d64b
      Clément Bœsch authored
      * commit 'a8fce24b':
        avconv_dxva2: support HEVC Main10 decoding
      
      This commit is a noop, see 1ec14612Merged-by: 's avatarClément Bœsch <u@pkh.me>
      8504d64b
    • Clément Bœsch's avatar
      Merge commit '33f6690e' · 5f74ce0e
      Clément Bœsch authored
      * commit '33f6690e':
        hevc: offer DXVA2 for 10bit 420
      
      This commit is a noop, see ccb94789Merged-by: 's avatarClément Bœsch <u@pkh.me>
      5f74ce0e
    • Clément Bœsch's avatar
      Merge commit '38efff92' · 74480198
      Clément Bœsch authored
      * commit '38efff92':
        FATE: add a test for H.264 with two fields per packet
        h264: fix decoding multiple fields per packet with slice threads
      
      This merge includes two commits because the FATE test was useful in
      order to make proper testing.
      
      The merge gets rid of the now unused:
      - SLICE_SINGLETHREAD and SLICE_SKIPED macros
      - max_contexts
      - "again" label in decode_nal_units()
      
      This commit also includes the fix from d3e4d406.
      
      Thanks to wm4 and Michael Niedermayer for their testing.
      Merged-by: 's avatarClément Bœsch <u@pkh.me>
      Merged-by: 's avatarMatthieu Bouron <matthieu.bouron@gmail.com>
      74480198
    • Steven Liu's avatar
      1033f56b
    • Michael Niedermayer's avatar
      avcodec/h264dec: Fix regression with "make fate-h264-attachment-631 THREADS=8" · 25f4f08b
      Michael Niedermayer authored
      This treats the case of no slices like no frames which it basically is.
      
      The field is added to the context as other nal related fields are also there
      and passing the has_slices field per *arguments is ugly and not consistent
      
      Found-by: ubitux
      Approved-by: ubitux
      Signed-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
      25f4f08b
    • Paul B Mahol's avatar
      avfilter: add EIA-608 line extractor · 08e57323
      Paul B Mahol authored
      Signed-off-by: 's avatarDave Rice <dave@dericed.com>
      Signed-off-by: 's avatarPaul B Mahol <onemda@gmail.com>
      08e57323
    • Steven Liu's avatar
      avformat/flvenc: refine the flvenc shift_data code · 1bb192ef
      Steven Liu authored
      refine the flvenc shift_data move data option
      Signed-off-by: 's avatarSteven Liu <lq@chinaffmpeg.org>
      1bb192ef
    • Steven Liu's avatar
    • Felipe Astroza's avatar
      libavformat/tee: tee was passing a wrong option name for fifo's format_options · b7665642
      Felipe Astroza authored
      If fifo is enabled on tee muxer, ffmpeg exits because of an unknown option passed to fifo muxer.
      Option name "format_options" was replaced by "format_opts" on tee muxer.
      Signed-off-by: 's avatarFelipe Astroza <felipe@astroza.cl>
      Signed-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
      b7665642
  2. 23 Jan, 2017 4 commits
  3. 22 Jan, 2017 9 commits
  4. 21 Jan, 2017 5 commits