Commits · db0b3dccb3842de134721e8d5c275f56d384340d · Linshizhi / ffmpeg.wasm-core

04 Nov, 2016 1 commit

arm: vp9mc: Insert a literal pool at the middle of the file · 392caa65

Martin Storsjö authored 8 years ago

This fixes errors like this when building non-pic binaries with armv6
as baseline:

Error: invalid literal constant: pool needs to be closer
Signed-off-by: Martin Storsjö <martin@martin.st>

392caa65

03 Nov, 2016 1 commit

arm: vp9: Add NEON optimizations of VP9 MC functions · ffbd1d2b

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

The filter coefficients are signed values, where the product of the
multiplication with one individual filter coefficient doesn't
overflow a 16 bit signed value (the largest filter coefficient is
127). But when the products are accumulated, the resulting sum can
overflow the 16 bit signed range. Instead of accumulating in 32 bit,
we accumulate the largest product (either index 3 or 4) last with a
saturated addition.

(The VP8 MC asm does something similar, but slightly simpler, by
accumulating each half of the filter separately. In the VP9 MC
filters, each half of the filter can also overflow though, so the
largest component has to be handled individually.)

Examples of relative speedup compared to the C version, from checkasm:
                       Cortex      A7     A8     A9    A53
vp9_avg4_neon:                   1.71   1.15   1.42   1.49
vp9_avg8_neon:                   2.51   3.63   3.14   2.58
vp9_avg16_neon:                  2.95   6.76   3.01   2.84
vp9_avg32_neon:                  3.29   6.64   2.85   3.00
vp9_avg64_neon:                  3.47   6.67   3.14   2.80
vp9_avg_8tap_smooth_4h_neon:     3.22   4.73   2.76   4.67
vp9_avg_8tap_smooth_4hv_neon:    3.67   4.76   3.28   4.71
vp9_avg_8tap_smooth_4v_neon:     5.52   7.60   4.60   6.31
vp9_avg_8tap_smooth_8h_neon:     6.22   9.04   5.12   9.32
vp9_avg_8tap_smooth_8hv_neon:    6.38   8.21   5.72   8.17
vp9_avg_8tap_smooth_8v_neon:     9.22  12.66   8.15  11.10
vp9_avg_8tap_smooth_64h_neon:    7.02  10.23   5.54  11.58
vp9_avg_8tap_smooth_64hv_neon:   6.76   9.46   5.93   9.40
vp9_avg_8tap_smooth_64v_neon:   10.76  14.13   9.46  13.37
vp9_put4_neon:                   1.11   1.47   1.00   1.21
vp9_put8_neon:                   1.23   2.17   1.94   1.48
vp9_put16_neon:                  1.63   4.02   1.73   1.97
vp9_put32_neon:                  1.56   4.92   2.00   1.96
vp9_put64_neon:                  2.10   5.28   2.03   2.35
vp9_put_8tap_smooth_4h_neon:     3.11   4.35   2.63   4.35
vp9_put_8tap_smooth_4hv_neon:    3.67   4.69   3.25   4.71
vp9_put_8tap_smooth_4v_neon:     5.45   7.27   4.49   6.52
vp9_put_8tap_smooth_8h_neon:     5.97   8.18   4.81   8.56
vp9_put_8tap_smooth_8hv_neon:    6.39   7.90   5.64   8.15
vp9_put_8tap_smooth_8v_neon:     9.03  11.84   8.07  11.51
vp9_put_8tap_smooth_64h_neon:    6.78   9.48   4.88  10.89
vp9_put_8tap_smooth_64hv_neon:   6.99   8.87   5.94   9.56
vp9_put_8tap_smooth_64v_neon:   10.69  13.30   9.43  14.34

For the larger 8tap filters, the speedup vs C code is around 5-14x.

This is significantly faster than libvpx's implementation of the same
functions, at least when comparing the put_8tap_smooth_64 functions
(compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from
libvpx).

Absolute runtimes from checkasm:
                          Cortex      A7        A8        A9       A53
vp9_put_8tap_smooth_64h_neon:    20150.3   14489.4   19733.6   10863.7
libvpx vpx_convolve8_horiz_neon: 52623.3   19736.4   21907.7   25027.7

vp9_put_8tap_smooth_64v_neon:    14455.0   12303.9   13746.4    9628.9
libvpx vpx_convolve8_vert_neon:  42090.0   17706.2   17659.9   16941.2

Thus, on the A9, the horizontal filter is only marginally faster than
libvpx, while our version is significantly faster on the other cores,
and the vertical filter is significantly faster on all cores. The
difference is especially large on the A7.

The libvpx implementation does the accumulation in 32 bit, which
probably explains most of the differences.
Signed-off-by: Martin Storsjö <martin@martin.st>

ffbd1d2b

29 Sep, 2016 3 commits
- h264chroma: Change type of stride parameters to ptrdiff_t · e4a94d8b
  Diego Biurrun authored 8 years ago
```
This avoids SIMD-optimized functions having to sign-extend their
stride argument manually to be able to do pointer arithmetic.
```
  e4a94d8b
- idct: Change type of array stride parameters to ptrdiff_t · 2ec9fa5e
  Diego Biurrun authored 8 years ago
```
ptrdiff_t is the correct type for array strides and similar.
```
  2ec9fa5e
- hpeldsp: arm: Update comments left behind in 25841dfe · 92c5755a
  Diego Biurrun authored 8 years ago
  
  92c5755a
28 Sep, 2016 1 commit
- lavc: add clobber tests for the new encoding/decoding API · de2ae3c1
  Anton Khirnov authored 8 years ago
  
  de2ae3c1
22 Sep, 2016 2 commits

audiodsp: reorder arguments for vector_clipf · 683da86a

Anton Khirnov authored 8 years ago

This will make the x86 asm simpler.

ARM conversion by Martin Storsjö <martin@martin.st> and Janne Grunau
<janne-libav@jannau.net>

683da86a

blockdsp: drop the high_bit_depth parameter · eea9857b
Anton Khirnov authored 8 years ago
```
It has no effect, since the code is supposed to operate the same way for
any bit depth.
```
eea9857b

14 Sep, 2016 1 commit

pixblockdsp: Change type of stride parameters to ptrdiff_t · de452e50

Diego Biurrun authored 8 years ago

This avoids SIMD-optimized functions having to sign-extend their
line size argument manually to be able to do pointer arithmetic.

Also adjust parameter names to be "stride" everywhere.

de452e50

26 Aug, 2016 4 commits

vp56: Separate VP5 and VP6 dsp initialization · 721d57e6

Diego Biurrun authored 8 years ago

VP5 has no arch-specific optimizations (nor will it get some in the
future), so it makes no sense to try to share dsp init code with VP6.

721d57e6

vp8: Update some assembly comments left unchanged in bd66f073 · 802727b5
Diego Biurrun authored 8 years ago

802727b5

vp56: Change type of stride parameters to ptrdiff_t · d9d26a36

Diego Biurrun authored 8 years ago

This avoids SIMD-optimized functions having to sign-extend their
line size argument manually to be able to do pointer arithmetic.

d9d26a36

vp3: Change type of stride parameters to ptrdiff_t · 6892df92

Diego Biurrun authored 8 years ago

This avoids SIMD-optimized functions having to sign-extend their
stride argument manually to be able to do pointer arithmetic.

Also adjust parameter names to be "stride" everywhere.

6892df92

17 Aug, 2016 1 commit
- simple_idct: arm: Drop disabled code variant · 014852e9
  Diego Biurrun authored 8 years ago
  
  014852e9
10 Jul, 2016 1 commit

vp8/armv6: mc: avoid boolean expression in calculation · 5f74bd31

Janne Grunau authored 8 years ago

GNU as evaluates true as '-1' while Apple's variant and llvm's internal
assembler evaluate it as '1'. The best way to avoid this madness is to
eliminate boolean expressions instead of trying to fix it with
preprocessor directives. Use a direct formula to calculate the
required temporary space on the stack in
ff_put_vp8_{epel,bilin}{4,8,16}_h[246]v[246]_armv6().

Fixes a checkasm segfault in vp8dsp.mc when using llvm's internal
assembler for a non-Apple target.

5f74bd31

06 Jul, 2016 1 commit
- arm: Fix a typo in a comment · e8b96a77
  Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  e8b96a77
26 Jun, 2016 1 commit
- libavcodec: fix constness in clobber test avcodec_open2() wrappers · 4a081f22
  Clément Bœsch authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  4a081f22
13 May, 2016 1 commit
- tests: Move all test programs to a subdirectory · a6a750c7
  Diego Biurrun authored 8 years ago
  
  a6a750c7
04 May, 2016 1 commit
- cosmetics: Fix spelling mistakes · 41ed7ab4
  Vittorio Giovara authored 8 years ago
```
Signed-off-by: Diego Biurrun <diego@biurrun.de>
```
  41ed7ab4
07 Apr, 2016 1 commit

build: miscellaneous cosmetics · 01621202

Diego Biurrun authored 8 years ago

Restore alphabetical order in lists, break overly long lines, do some
prettyprinting, add some explanatory section comments, group parts
together that belong together logically.

01621202

01 Mar, 2016 1 commit
- fft: Split MDCT bits off from FFT · 1a094af6
  Diego Biurrun authored 8 years ago
  
  1a094af6
26 Feb, 2016 2 commits
- rdft: arm: Split RDFT initialization into a separate file · 4c297249
  Diego Biurrun authored 8 years ago
  
  4c297249
- fft: arm: Drop unnecessary #include, add missing ones · 97aec6e7
  Diego Biurrun authored 8 years ago
  
  97aec6e7
19 Feb, 2016 1 commit
- build: Add vc1dsp component for more fine-grained dependencies · 15a24614
  Diego Biurrun authored 8 years ago
  
  15a24614
24 Dec, 2015 1 commit
- dca: remove unused decode_hf function and quant_d tables · 2008f760
  Alexandra Hájková authored 9 years ago
```
They were superseded with their integer equivalents. Rename integer
decode_hf to decode_hf.
```
  2008f760
14 Dec, 2015 2 commits

arm: add ff_int32_to_float_fmul_array8_neon · 90b1b935

Janne Grunau authored 9 years ago

Quite a bit faster than int32_to_float_fmul_array8_c calling
ff_int32_to_float_fmul_scalar_neon through FmtConvertContext.
Number of cycles per int32_to_float_fmul_array8 call while decoding
padded.dts on exynos5422:

               before  after   change
cortex-a7:     1270     951    -25%
cortex-a15:     434     285    -34%

checkasm --bench cycle counts:     cortex-a15   cortex-a7
int32_to_float_fmul_array8_c:      1730.4       4384.5
int32_to_float_fmul_array8_neon_c:  571.5       1694.3
int32_to_float_fmul_array8_neon:    374.0       1448.8

Interesting are the differences between
int32_to_float_fmul_array8_neon_c and int32_to_float_fmul_array8_neon.
The former is current behaviour of calling
ff_int32_to_float_fmul_scalar_neon repeatedly from the c function,
The raw numbers differ since checkasm uses different lengths than the
dca decoder.

90b1b935

arm: add a cpu flag for the VFPv2 vector mode · e2710e79

Janne Grunau authored 9 years ago

The vector mode was deprecated in ARMv7-A/VFPv3 and various cpu
implementations do not support it in hardware. Vector mode code will
depending the OS either be emulated in software or result in an illegal
instruction on cpus which does not support it. This was not really
problem in practice since NEON implementations of the same functions are
preferred. It will however become a problem for checkasm which tests
every cpu flag separately.

Since this is a cpu feature newer cpu do not support anymore the
behaviour of this flag differs from the other flags. It can be only
activated by runtime cpu feature selection.

e2710e79

20 Jul, 2015 1 commit

arm: use a local label instead of the function symbol in ff_prefetch_arm · 9ed6f9a1

Janne Grunau authored 9 years ago

Avoids a relocation which might end out of range for thumb2.
Reported-By: Ludovic Fauvet <etix@videolan.org>
Bug-Id: https://bugs.webkit.org/show_bug.cgi?id=137022
CC: libav-stable@libav.org

9ed6f9a1

17 Jul, 2015 5 commits
- h264: arm: use intra pred8x8 functions only for chroma_format_idc <= 1 · 256ef198
  Janne Grunau authored 9 years ago
  
  256ef198
- configure: Factor out g722dsp module · f5ee2300
  Vittorio Giovara authored 9 years ago
  
  f5ee2300
- configure: Factor out vp8dsp module · d42191c7
  Vittorio Giovara authored 9 years ago
  
  d42191c7
- configure: Factor out rv34dsp module · 5cb4bdb2
  Vittorio Giovara authored 9 years ago
  
  5cb4bdb2
- configure: Factor out flacdsp module · b075869b
  Vittorio Giovara authored 9 years ago
  
  b075869b
28 Feb, 2015 2 commits
- lavc: do not compile fmtconvert unconditionally · 71f1ad37
  Anton Khirnov authored 9 years ago
```
Only ac3dec and dcadec use it.
```
  71f1ad37
- fmtconvert: drop unused functions · d74a8cb7
  Anton Khirnov authored 9 years ago
  
  d74a8cb7
15 Feb, 2015 1 commit
- g722: Add ARM NEON implementation for g722_apply_qmf() · 70245853
  Peter Meerwald authored 9 years ago
```
Signed-off-by: Peter Meerwald <pmeerw@pmeerw.net>
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  70245853
09 Dec, 2014 3 commits
- arm: mlpdsp: handle pic offset calculation in a macro · 4c81613d
  Janne Grunau authored 10 years ago
```
Makes the code easier to read since it hides different offset
calculations for arm and thumb mode.
```
  4c81613d
- arm: make ff_mlp_filter_channel_arm and ff_mlp_rematrix_channel_arm position independent · 581c7f0e
  Janne Grunau authored 10 years ago
```
No significant difference in used cpu cycles on a cortex-a9.
```
  581c7f0e
- arm: Use .data.rel.ro for const data with relocations · f963f803
  Martin Storsjö authored 10 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  f963f803
08 Dec, 2014 1 commit

arm: fft_vfp: Unify the behaviour in ff_fft_calc_vfp between arm/thumb · b280c620

Martin Storsjö authored 10 years ago

Don't include the function pointer table in the code segment
in arm mode.

This shouldn't have any significant performance effect. It does
end up as a few more instructions than before, for ARM, but
only at the entry to this function, not within the fft functions
themselves.
Signed-off-by: Martin Storsjö <martin@martin.st>

b280c620