Commits · 83f2555e5ff571cbf5c226a920602e91228039ab · Linshizhi / ffmpeg.wasm-core

21 Mar, 2019 1 commit

arm: Implement a NEON version of 422 h264_h_loop_filter_chroma · 0676de93

Martin Storsjö authored 6 years ago

Previously, the 420 version was used even for 422.

This fixes occasional checkasm failures.
Signed-off-by: Martin Storsjö <martin@martin.st>

0676de93

20 Feb, 2019 1 commit
- arm/h264dsp: change loop filter stride argument to ptrdiff_t · 7b9ca44c
  James Almer authored 6 years ago
```
This was missed in d5d699abSigned-off-by: James Almer <jamrial@gmail.com>
```
  7b9ca44c
19 Feb, 2019 1 commit

arm: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2 · cef914e0

Martin Storsjö authored 6 years ago

This makes it similar to put_epel16_v6, and gives a 10-25%
speedup of this function.

Before:                   Cortex A7       A8       A9      A53     A72
vp8_put_epel16_h6v6_neon:    3058.0   2218.5   2459.8   2183.0  1572.2
After:
vp8_put_epel16_h6v6_neon:    2670.8   1934.2   2244.4   1729.4  1503.9
Signed-off-by: Martin Storsjö <martin@martin.st>

cef914e0

09 Apr, 2018 1 commit

avcodec/arm/hevcdsp_sao : add NEON optimization for sao · 3b2fd960

Meng Wang authored 6 years ago

Signed-off-by: Meng Wang <wangmeng.kids@bytedance.com>
Reviewed-by: Shengbin Meng <shengbinmeng@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

3b2fd960

31 Mar, 2018 2 commits

arm: hevcdsp: Add commas between macro arguments · 5f83935d

Martin Storsjö authored 6 years ago

When targeting darwin, clang requires commas between arguments,
while the no-comma form is allowed for other targets.

Since Xcode 9.3, the bundled clang supports altmacro and doesn't
require using gas-preprocessor any longer.
Signed-off-by: Martin Storsjö <martin@martin.st>

5f83935d

arm: hevcdsp: Avoid using macro expansion counters · 6660bc03

Martin Storsjö authored 6 years ago

Clang supports the macro expansion counter (used for making unique
labels within macro expansions), but not when targeting darwin.

Convert uses of the counter into normal local labels, as used
elsewhere.

Since Xcode 9.3, the bundled clang supports altmacro and doesn't
require using gas-preprocessor any longer.
Signed-off-by: Martin Storsjö <martin@martin.st>

6660bc03

30 Mar, 2018 1 commit

arm: vc1dsp: Add commas between macro arguments · ab05d393

Martin Storsjö authored 6 years ago

When targeting darwin, clang requires commas between arguments,
while the no-comma form is allowed for other targets.

Since Xcode 9.3, the bundled clang supports altmacro and doesn't
require using gas-preprocessor any longer.
Signed-off-by: Martin Storsjö <martin@martin.st>

ab05d393

07 Mar, 2018 1 commit
- sbcenc: add armv6 and neon asm optimizations · f677718b
  Aurelien Jacobs authored 7 years ago
```
This was originally based on libsbc, and was fully integrated into ffmpeg.
```
  f677718b
12 Jan, 2018 1 commit

avcodec/arm/sbrdsp_neon: Use a free register instead of putting 2 things in one · 7dbbb75e

Michael Niedermayer authored 7 years ago

Fixes high pitched shriek
Fixes: 25420848_1478428308873746_4255813235963330560_n.mp4
Reported-by: Dale Curtis <dalecurtis@google.com>
Reviewed-by: Dale Curtis <dalecurtis@chromium.org>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

7dbbb75e

09 Dec, 2017 1 commit

arm/hevc_idct: fix compilation on Android · 36de24d5

James Almer authored 7 years ago

Compilation error "out of range" fixed for armeabi-v7a. Compilation failed
trying to build libvlc.aar for ARM7 android on ubuntu 16.04 host. Error
messages is "Offset out of range". The reason of the error is assembler LDR
directives in function "ff_hevc_transform_luma_4x4_neon_8" need local storage
in range <1k, but no such storage provided.

Based on a patch by Ihor Bobalo <bob@eleks.com>

Suggested-by: wbs
Signed-off-by: James Almer <jamrial@gmail.com>

36de24d5

08 Dec, 2017 1 commit

hevc: Add hevc_get_pixel_4/8/12/16/24/32/48/64 · 7993ec19

Alexandra Hájková authored 7 years ago

Checkasm timings:
block size bitdepth  C       NEON
4           8 bit:    146.7   48.7
           10 bit:    146.7   52.7
8           8 bit:    430.3   84.4
           10 bit:    430.4  119.5
12          8 bit:    812.8  141.0
           10 bit:    812.8  195.0
16          8 bit:   1499.1  268.0
           10 bit:   1498.9  368.4
24          8 bit:   4394.2  574.8
           10 bit:   3696.3  804.8
32          8 bit:   5108.6  568.9
           10 bit:   4249.6  918.8
48          8 bit:  16819.6 2304.9
           10 bit:  13882.0 3178.5
64          8 bit:  13490.8 1799.5
           10 bit:  11018.5 2519.4
Signed-off-by: Martin Storsjö <martin@martin.st>

7993ec19

24 Oct, 2017 1 commit

arm: Remove a redundant check in fmtconvert_init_arm.c · b487add7

Martin Storsjö authored 7 years ago

This was missed in e2710e79, where have_vfp && !have_vfpv3 were
converted into have_vfp_vm.
Signed-off-by: Martin Storsjö <martin@martin.st>

b487add7

02 Sep, 2017 1 commit

arm: Fix SIGBUS on ARM when compiled with binutils 2.29 · 9dde6ab0

Martin Storsjö authored 7 years ago

In binutils 2.29, the behavior of the ADR instruction changed so that 1 is
added to the address of a Thumb function (previously nothing was added). This
allows the loaded address to be passed to a BLX instruction and the correct
mode change will occur.

See: https://sourceware.org/bugzilla/show_bug.cgi?id=21458

By using adr with a label that isn't annotated as a thumb function,
we avoid the new behaviour in binutils 2.29 and get the same behaviour
as in prior releases, and as in other assemblers (ms armasm.exe,
clang's built in assembler) - an idea that Janne Grunau came up with.
Signed-off-by: Martin Storsjö <martin@martin.st>

9dde6ab0

11 Jul, 2017 1 commit

avcodec/rdft: remove sintable · 0780ad9c

Muhammad Faiz authored 7 years ago

It is redundant with costable. The first half of sintable is
identical with the second half of costable. The second half
of sintable is negative value of the first half of sintable.

The computation is changed to handle sign of sin values, in
C code and ARM assembly code.
Signed-off-by: Muhammad Faiz <mfcc64@gmail.com>

0780ad9c

28 Jun, 2017 2 commits

lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis · b12a3617
Clément Bœsch authored 7 years ago

b12a3617

lavc/arm: fix lack of precision in ff_ps_stereo_interpolate_neon · e4a27e2f

Clément Bœsch authored 7 years ago

The code originally pre-multiply by 2 the steps, causing the running sum
of the h factors to drift away due to the lack of precision. It quickly
causes an inaccuracy > 0.01.

I tried diverse approaches such as multiply by 2.0 (instead of adding
the value itself) without success.

I'm unable to bench the impact of this change, feel free to compare.

This commit fixes the incoming aacpsdsp tests.

Following is an alternative simplified function (matching the incoming
AArch64 code) that may be used:

function ff_ps_stereo_interpolate_neon, export=1
        vld1.32         {q0}, [r2]
        vld1.32         {q1}, [r3]
        ldr             r12, [sp]
        vmov.f32        q8, q0
        vmov.f32        q9, q1
        vzip.32         q8, q0
        vzip.32         q9, q1
1:
        vld1.32         {d4}, [r0,:64]
        vld1.32         {d6}, [r1,:64]
        vadd.f32        q8, q8, q9
        vadd.f32        q0, q0, q1
        vmov.f32        d5, d4
        vmov.f32        d7, d6
        vmul.f32        q2, q2, q8
        vmla.f32        q2, q3, q0
        vst1.32         {d4}, [r0,:64]!
        vst1.32         {d5}, [r1,:64]!
        subs            r12, r12, #1
        bgt             1b
        bx              lr
endfunc

e4a27e2f

15 May, 2017 1 commit

arm: Avoid using .dn register aliases · d7320ca3

Martin Storsjö authored 7 years ago

clang now (in the upcoming 5.0 version) is capable of building our
arm assembly without relying on gas-preprocessor, although clang/LLVM
doesn't support .dn register aliases.

The VC1 MC assembly was only built and used if the chosen assembler
supported the .dn directives though. This was supported as long as
gas-preprocessor was used.

This means that VC1 decoding got a speed regression on clang 5.0,
unless the user manually chose using gas-preprocessor again.

By avoiding using the .dn register aliases, we can build the VC1 MC
assembly with the latest clang version.

Support for the .dn/.qn directives in clang/LLVM isn't actively planned,
see https://bugs.llvm.org/show_bug.cgi?id=18199.

This partially reverts 896a5bff.
Signed-off-by: Martin Storsjö <martin@martin.st>

d7320ca3

04 May, 2017 2 commits
- hevc: Add NEON 32x32 IDCT · ce080f47
  Alexandra Hájková authored 7 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  ce080f47
- hevc: 16x16 NEON idct: Use the right element size for loads/stores · 118dd4a3
  Alexandra Hájková authored 7 years ago
```
This doesn't change the actual behaviour of the code but improves
readability.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  118dd4a3
01 May, 2017 1 commit
- hevc: Add NEON add_residual for bitdepth 10 · edbf0fff
  Alexandra Hájková authored 7 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  edbf0fff
28 Apr, 2017 1 commit

arm: hevc_idct: Tune the add_res_8x8 and add_res_32x32 functions · e1c2453a

Martin Storsjö authored 7 years ago

Before:              Cortex     A7      A8      A9     A53
hevc_add_res_8x8_8_neon:     116.0    58.7    80.2    90.7
hevc_add_res_32x32_8_neon:  1230.0   737.5  1187.5   974.4
After:
hevc_add_res_8x8_8_neon:      97.7    57.0    73.7    80.0
hevc_add_res_32x32_8_neon:  1216.0   698.7  1127.5   827.1
Signed-off-by: Martin Storsjö <martin@martin.st>

e1c2453a

27 Apr, 2017 1 commit
- hevc: Add NEON add_residual for bitdepth 8 · 0d4d4351
  Seppo Tomperi authored 7 years ago
```
Optimized by Alexandra Hájková.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  0d4d4351
25 Apr, 2017 2 commits
- hevc: Add support for bitdepth 10 for IDCT DC · 3d69dd65
  Alexandra Hájková authored 7 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  3d69dd65
- hevc: Add NEON IDCT DC functions for bitdepth 8 · 358adef0
  Seppo Tomperi authored 7 years ago
```
Signed-off-by: Alexandra Hájková <alexandra@khirnov.net>
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  358adef0
12 Apr, 2017 1 commit

hevc: Add NEON 16x16 IDCT · 89d9869d

Alexandra Hájková authored 7 years ago

The speedup vs C code is around 6-13x.
Signed-off-by: Martin Storsjö <martin@martin.st>

89d9869d

06 Apr, 2017 1 commit

idct_arm: remove use of ff_put/add_pixels_clamped function pointer. · 40cbd686

Ronald S. Bultje authored 7 years ago

Instead, hardcode the use of the _arm implementation of add_pixels,
and use the C version for put_pixels (as no arm-optimized version
exists). Since there's separate implementations of idct{,_put,_add}
for neon, this has no practical impact on performance.

40cbd686

28 Mar, 2017 3 commits

vp9: split out generic decoding skeleton interface API from VP9 types. · 0c466417

Ronald S. Bultje authored 7 years ago

This allows vp9dsp.h to only include the VP9 types header, and not the
decoder skeleton interface which is for hardware decoders (dxva2/vaapi).

0c466417

vp9: re-split the decoder/format/dsp interface header files. · f8c01994

Ronald S. Bultje authored 7 years ago

The advantage here is that the internal software decoder interface is
not exposed to the DSP functions or the hardware accelerations.

f8c01994

arm: Always build the hevcdsp_init_arm.c file · fbc6f190

Martin Storsjö authored 7 years ago

The main hevcdsp.c file calls this init function if HAVE_ARM is set,
regardless of whether neon support is available or not.

This fixes builds where neon isn't supported by the build tools at all.
Signed-off-by: Martin Storsjö <martin@martin.st>

fbc6f190

27 Mar, 2017 2 commits

hevc: Add NEON 4x4 and 8x8 IDCT · 0b9a237b

Alexandra Hájková authored 7 years ago

Optimized by Martin Storsjö <martin@martin.st>.

The speedup vs C code is around 3.2-4.4x.
Signed-off-by: Martin Storsjö <martin@martin.st>

0b9a237b

lavc/vp9: split into vp9{block,data,mvs} · 1c9f4b50
Clément Bœsch authored 7 years ago
```
This is following Libav layout to ease merges.
```
1c9f4b50

20 Mar, 2017 1 commit
- lavc/arm: fix indent in blockdsp_init_neon · b78243c5
  Clément Bœsch authored 7 years ago
  
  b78243c5
19 Mar, 2017 8 commits

arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible · eabc5abf

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 14516 bytes to 22484 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 270.7 418.5 295.4
vp9_inv_dct_dct_16x16_sub2_add_10_neon: 3840.2 3244.8 3700.1 2337.9
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4212.5 3575.4 3996.9 2571.6
vp9_inv_dct_dct_16x16_sub8_add_10_neon: 5174.4 4270.5 4615.5 3031.9
vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5676.0 4908.5 5226.5 3491.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6403.9 5589.0 5839.8 3948.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1710.7 944.7 1582.1 1045.4
vp9_inv_dct_dct_32x32_sub2_add_10_neon: 21040.7 16706.1 18687.7 13193.1
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22197.7 18282.7 19577.5 13918.6
vp9_inv_dct_dct_32x32_sub8_add_10_neon: 24511.5 20911.5 21472.5 15367.5
vp9_inv_dct_dct_32x32_sub12_add_10_neon: 26939.5 24264.3 23239.1 16830.3
vp9_inv_dct_dct_32x32_sub16_add_10_neon: 29419.5 26845.1 25020.6 18259.9
vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31146.4 29633.5 26803.3 19721.7
vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33376.3 32507.8 28642.4 21174.2
vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35629.4 35439.6 30416.5 22625.7
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37269.9 37914.9 32271.9 24078.9

After:
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 276.0 418.5 295.1
vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2336.2 1886.0 2251.0 1458.6
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 2531.0 2054.7 2402.8 1591.1
vp9_inv_dct_dct_16x16_sub8_add_10_neon: 3848.6 3491.1 3845.7 2554.8
vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5703.8 4831.6 5230.8 3493.4
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6399.5 5567.0 5832.4 3951.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1722.1 938.5 1577.3 1044.5
vp9_inv_dct_dct_32x32_sub2_add_10_neon: 15003.5 11576.8 13105.8 9602.2
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 15768.5 12677.2 13726.0 10138.1
vp9_inv_dct_dct_32x32_sub8_add_10_neon: 17278.8 14825.4 14907.5 11185.7
vp9_inv_dct_dct_32x32_sub12_add_10_neon: 22335.7 21544.5 20379.5 15019.8
vp9_inv_dct_dct_32x32_sub16_add_10_neon: 24165.6 23881.7 21938.6 16308.2
vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31082.2 30860.9 26835.3 19711.3
vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33102.6 31922.8 28638.3 21161.0
vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35104.9 34867.5 30411.7 22621.2
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37438.1 39103.4 32217.8 24067.6
Signed-off-by: Martin Storsjö <martin@martin.st>

eabc5abf

arm: vp9itxfm16: Make the larger core transforms standalone functions · 0ea60320

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from
17500 to 14516 bytes.

This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.

Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4237.4 3561.5 3971.8 2525.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6371.9 5452.0 5779.3 3910.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22068.8 17867.5 19555.2 13871.6
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37268.9 38684.2 32314.2 23969.0

After:
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4375.1 3571.9 4283.8 2567.2
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6415.6 5578.9 5844.6 3948.3
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.7 18079.7 19603.7 13905.3
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37593.2 38862.2 32235.8 24070.9
Signed-off-by: Martin Storsjö <martin@martin.st>

0ea60320

arm: vp9itxfm16: Avoid reloading the idct32 coefficients · 32e273c1

Martin Storsjö authored 8 years ago

Keep the idct32 coefficients in narrow form in q6-q7, and idct16
coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering
q0-q3 in the pass1 function, and squeeze the idct16 coefficients
into q0-q1 in the pass2 function to avoid reloading them.

The idct16 coefficients are clobbered and reloaded within idct32_odd
though, since that turns out to be faster than narrowing them and
swapping them into q6-q7.

Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.8 18268.4 19598.0 14079.0
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37699.0 38665.2 32542.3 24472.2
After:
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22270.8 18159.3 19531.0 13865.0
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37523.3 37731.6 32181.7 24071.2
Signed-off-by: Martin Storsjö <martin@martin.st>

32e273c1

arm: vp9itxfm16: Fix vertical alignment · c1619318
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
c1619318

arm: vp9itxfm16: Use the right lane size · b46d37e9

Martin Storsjö authored 8 years ago

This makes the code slightly clearer, but doesn't make any functional
difference.
Signed-off-by: Martin Storsjö <martin@martin.st>

b46d37e9

arm/aarch64: vp9: Fix vertical alignment · 21c89f3a

Martin Storsjö authored 8 years ago

Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

This is cherrypicked from libav commit
7995ebfa.
Signed-off-by: Martin Storsjö <martin@martin.st>

21c89f3a

arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 70317b25

Martin Storsjö authored 8 years ago

In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

This is cherrypicked from libav commit
3a0d5e20.
Signed-off-by: Martin Storsjö <martin@martin.st>

70317b25

arm: vp9itxfm: Template the quarter/half idct32 function · b7a565fe

Martin Storsjö authored 8 years ago

This reduces the number of lines and reduces the duplication.

Also simplify the eob check for the half case.

If we are in the half case, we know we at least will need to do the
first three slices, we only need to check eob for the fourth one,
so we can hardcode the value to check against instead of loading
from the min_eob array.

Since at most one slice can be skipped in the first pass, we can
unroll the loop for filling zeros completely, as it was done for
the quarter case before.

This allows skipping loading the min_eob pointer when using the
quarter/half cases.

This is cherrypicked from libav commit
98ee855a.
Signed-off-by: Martin Storsjö <martin@martin.st>

b7a565fe