Commits · d68a557df4937b695a5f6a14877d92647be95aaa · Linshizhi / ffmpeg.wasm-core

11 Jul, 2017 1 commit

avcodec/rdft: remove sintable · 0780ad9c

Muhammad Faiz authored 7 years ago

It is redundant with costable. The first half of sintable is
identical with the second half of costable. The second half
of sintable is negative value of the first half of sintable.

The computation is changed to handle sign of sin values, in
C code and ARM assembly code.
Signed-off-by: Muhammad Faiz <mfcc64@gmail.com>

0780ad9c

28 Jun, 2017 2 commits

lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis · b12a3617
Clément Bœsch authored 7 years ago

b12a3617

lavc/arm: fix lack of precision in ff_ps_stereo_interpolate_neon · e4a27e2f

Clément Bœsch authored 7 years ago

The code originally pre-multiply by 2 the steps, causing the running sum
of the h factors to drift away due to the lack of precision. It quickly
causes an inaccuracy > 0.01.

I tried diverse approaches such as multiply by 2.0 (instead of adding
the value itself) without success.

I'm unable to bench the impact of this change, feel free to compare.

This commit fixes the incoming aacpsdsp tests.

Following is an alternative simplified function (matching the incoming
AArch64 code) that may be used:

function ff_ps_stereo_interpolate_neon, export=1
        vld1.32         {q0}, [r2]
        vld1.32         {q1}, [r3]
        ldr             r12, [sp]
        vmov.f32        q8, q0
        vmov.f32        q9, q1
        vzip.32         q8, q0
        vzip.32         q9, q1
1:
        vld1.32         {d4}, [r0,:64]
        vld1.32         {d6}, [r1,:64]
        vadd.f32        q8, q8, q9
        vadd.f32        q0, q0, q1
        vmov.f32        d5, d4
        vmov.f32        d7, d6
        vmul.f32        q2, q2, q8
        vmla.f32        q2, q3, q0
        vst1.32         {d4}, [r0,:64]!
        vst1.32         {d5}, [r1,:64]!
        subs            r12, r12, #1
        bgt             1b
        bx              lr
endfunc

e4a27e2f

15 May, 2017 1 commit

arm: Avoid using .dn register aliases · d7320ca3

Martin Storsjö authored 7 years ago

clang now (in the upcoming 5.0 version) is capable of building our
arm assembly without relying on gas-preprocessor, although clang/LLVM
doesn't support .dn register aliases.

The VC1 MC assembly was only built and used if the chosen assembler
supported the .dn directives though. This was supported as long as
gas-preprocessor was used.

This means that VC1 decoding got a speed regression on clang 5.0,
unless the user manually chose using gas-preprocessor again.

By avoiding using the .dn register aliases, we can build the VC1 MC
assembly with the latest clang version.

Support for the .dn/.qn directives in clang/LLVM isn't actively planned,
see https://bugs.llvm.org/show_bug.cgi?id=18199.

This partially reverts 896a5bff.
Signed-off-by: Martin Storsjö <martin@martin.st>

d7320ca3

04 May, 2017 2 commits
- hevc: Add NEON 32x32 IDCT · ce080f47
  Alexandra Hájková authored 7 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  ce080f47
- hevc: 16x16 NEON idct: Use the right element size for loads/stores · 118dd4a3
  Alexandra Hájková authored 7 years ago
```
This doesn't change the actual behaviour of the code but improves
readability.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  118dd4a3
01 May, 2017 1 commit
- hevc: Add NEON add_residual for bitdepth 10 · edbf0fff
  Alexandra Hájková authored 7 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  edbf0fff
28 Apr, 2017 1 commit

arm: hevc_idct: Tune the add_res_8x8 and add_res_32x32 functions · e1c2453a

Martin Storsjö authored 7 years ago

Before:              Cortex     A7      A8      A9     A53
hevc_add_res_8x8_8_neon:     116.0    58.7    80.2    90.7
hevc_add_res_32x32_8_neon:  1230.0   737.5  1187.5   974.4
After:
hevc_add_res_8x8_8_neon:      97.7    57.0    73.7    80.0
hevc_add_res_32x32_8_neon:  1216.0   698.7  1127.5   827.1
Signed-off-by: Martin Storsjö <martin@martin.st>

e1c2453a

27 Apr, 2017 1 commit
- hevc: Add NEON add_residual for bitdepth 8 · 0d4d4351
  Seppo Tomperi authored 7 years ago
```
Optimized by Alexandra Hájková.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  0d4d4351
25 Apr, 2017 2 commits
- hevc: Add support for bitdepth 10 for IDCT DC · 3d69dd65
  Alexandra Hájková authored 7 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  3d69dd65
- hevc: Add NEON IDCT DC functions for bitdepth 8 · 358adef0
  Seppo Tomperi authored 7 years ago
```
Signed-off-by: Alexandra Hájková <alexandra@khirnov.net>
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  358adef0
12 Apr, 2017 1 commit

hevc: Add NEON 16x16 IDCT · 89d9869d

Alexandra Hájková authored 7 years ago

The speedup vs C code is around 6-13x.
Signed-off-by: Martin Storsjö <martin@martin.st>

89d9869d

06 Apr, 2017 1 commit

idct_arm: remove use of ff_put/add_pixels_clamped function pointer. · 40cbd686

Ronald S. Bultje authored 7 years ago

Instead, hardcode the use of the _arm implementation of add_pixels,
and use the C version for put_pixels (as no arm-optimized version
exists). Since there's separate implementations of idct{,_put,_add}
for neon, this has no practical impact on performance.

40cbd686

28 Mar, 2017 3 commits

vp9: split out generic decoding skeleton interface API from VP9 types. · 0c466417

Ronald S. Bultje authored 7 years ago

This allows vp9dsp.h to only include the VP9 types header, and not the
decoder skeleton interface which is for hardware decoders (dxva2/vaapi).

0c466417

vp9: re-split the decoder/format/dsp interface header files. · f8c01994

Ronald S. Bultje authored 7 years ago

The advantage here is that the internal software decoder interface is
not exposed to the DSP functions or the hardware accelerations.

f8c01994

arm: Always build the hevcdsp_init_arm.c file · fbc6f190

Martin Storsjö authored 7 years ago

The main hevcdsp.c file calls this init function if HAVE_ARM is set,
regardless of whether neon support is available or not.

This fixes builds where neon isn't supported by the build tools at all.
Signed-off-by: Martin Storsjö <martin@martin.st>

fbc6f190

27 Mar, 2017 2 commits

hevc: Add NEON 4x4 and 8x8 IDCT · 0b9a237b

Alexandra Hájková authored 7 years ago

Optimized by Martin Storsjö <martin@martin.st>.

The speedup vs C code is around 3.2-4.4x.
Signed-off-by: Martin Storsjö <martin@martin.st>

0b9a237b

lavc/vp9: split into vp9{block,data,mvs} · 1c9f4b50
Clément Bœsch authored 7 years ago
```
This is following Libav layout to ease merges.
```
1c9f4b50

20 Mar, 2017 1 commit
- lavc/arm: fix indent in blockdsp_init_neon · b78243c5
  Clément Bœsch authored 7 years ago
  
  b78243c5
19 Mar, 2017 8 commits

arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible · eabc5abf

Martin Storsjö authored 7 years ago

This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 14516 bytes to 22484 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 270.7 418.5 295.4
vp9_inv_dct_dct_16x16_sub2_add_10_neon: 3840.2 3244.8 3700.1 2337.9
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4212.5 3575.4 3996.9 2571.6
vp9_inv_dct_dct_16x16_sub8_add_10_neon: 5174.4 4270.5 4615.5 3031.9
vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5676.0 4908.5 5226.5 3491.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6403.9 5589.0 5839.8 3948.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1710.7 944.7 1582.1 1045.4
vp9_inv_dct_dct_32x32_sub2_add_10_neon: 21040.7 16706.1 18687.7 13193.1
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22197.7 18282.7 19577.5 13918.6
vp9_inv_dct_dct_32x32_sub8_add_10_neon: 24511.5 20911.5 21472.5 15367.5
vp9_inv_dct_dct_32x32_sub12_add_10_neon: 26939.5 24264.3 23239.1 16830.3
vp9_inv_dct_dct_32x32_sub16_add_10_neon: 29419.5 26845.1 25020.6 18259.9
vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31146.4 29633.5 26803.3 19721.7
vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33376.3 32507.8 28642.4 21174.2
vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35629.4 35439.6 30416.5 22625.7
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37269.9 37914.9 32271.9 24078.9

After:
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 276.0 418.5 295.1
vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2336.2 1886.0 2251.0 1458.6
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 2531.0 2054.7 2402.8 1591.1
vp9_inv_dct_dct_16x16_sub8_add_10_neon: 3848.6 3491.1 3845.7 2554.8
vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5703.8 4831.6 5230.8 3493.4
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6399.5 5567.0 5832.4 3951.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1722.1 938.5 1577.3 1044.5
vp9_inv_dct_dct_32x32_sub2_add_10_neon: 15003.5 11576.8 13105.8 9602.2
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 15768.5 12677.2 13726.0 10138.1
vp9_inv_dct_dct_32x32_sub8_add_10_neon: 17278.8 14825.4 14907.5 11185.7
vp9_inv_dct_dct_32x32_sub12_add_10_neon: 22335.7 21544.5 20379.5 15019.8
vp9_inv_dct_dct_32x32_sub16_add_10_neon: 24165.6 23881.7 21938.6 16308.2
vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31082.2 30860.9 26835.3 19711.3
vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33102.6 31922.8 28638.3 21161.0
vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35104.9 34867.5 30411.7 22621.2
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37438.1 39103.4 32217.8 24067.6
Signed-off-by: Martin Storsjö <martin@martin.st>

eabc5abf

arm: vp9itxfm16: Make the larger core transforms standalone functions · 0ea60320

Martin Storsjö authored 7 years ago

This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from
17500 to 14516 bytes.

This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.

Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4237.4 3561.5 3971.8 2525.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6371.9 5452.0 5779.3 3910.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22068.8 17867.5 19555.2 13871.6
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37268.9 38684.2 32314.2 23969.0

After:
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4375.1 3571.9 4283.8 2567.2
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6415.6 5578.9 5844.6 3948.3
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.7 18079.7 19603.7 13905.3
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37593.2 38862.2 32235.8 24070.9
Signed-off-by: Martin Storsjö <martin@martin.st>

0ea60320

arm: vp9itxfm16: Avoid reloading the idct32 coefficients · 32e273c1

Martin Storsjö authored 7 years ago

Keep the idct32 coefficients in narrow form in q6-q7, and idct16
coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering
q0-q3 in the pass1 function, and squeeze the idct16 coefficients
into q0-q1 in the pass2 function to avoid reloading them.

The idct16 coefficients are clobbered and reloaded within idct32_odd
though, since that turns out to be faster than narrowing them and
swapping them into q6-q7.

Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.8 18268.4 19598.0 14079.0
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37699.0 38665.2 32542.3 24472.2
After:
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22270.8 18159.3 19531.0 13865.0
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37523.3 37731.6 32181.7 24071.2
Signed-off-by: Martin Storsjö <martin@martin.st>

32e273c1

arm: vp9itxfm16: Fix vertical alignment · c1619318
Martin Storsjö authored 7 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
c1619318

arm: vp9itxfm16: Use the right lane size · b46d37e9

Martin Storsjö authored 7 years ago

This makes the code slightly clearer, but doesn't make any functional
difference.
Signed-off-by: Martin Storsjö <martin@martin.st>

b46d37e9

arm/aarch64: vp9: Fix vertical alignment · 21c89f3a

Martin Storsjö authored 8 years ago

Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

This is cherrypicked from libav commit
7995ebfa.
Signed-off-by: Martin Storsjö <martin@martin.st>

21c89f3a

arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 70317b25

Martin Storsjö authored 7 years ago

In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

This is cherrypicked from libav commit
3a0d5e20.
Signed-off-by: Martin Storsjö <martin@martin.st>

70317b25

arm: vp9itxfm: Template the quarter/half idct32 function · b7a565fe

Martin Storsjö authored 7 years ago

This reduces the number of lines and reduces the duplication.

Also simplify the eob check for the half case.

If we are in the half case, we know we at least will need to do the
first three slices, we only need to check eob for the fourth one,
so we can hardcode the value to check against instead of loading
from the min_eob array.

Since at most one slice can be skipped in the first pass, we can
unroll the loop for filling zeros completely, as it was done for
the quarter case before.

This allows skipping loading the min_eob pointer when using the
quarter/half cases.

This is cherrypicked from libav commit
98ee855a.
Signed-off-by: Martin Storsjö <martin@martin.st>

b7a565fe

16 Mar, 2017 1 commit

arm/aarch64: vp9: Fix vertical alignment · 7995ebfa

Martin Storsjö authored 8 years ago

Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.
Signed-off-by: Martin Storsjö <martin@martin.st>

7995ebfa

11 Mar, 2017 12 commits

arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 3a0d5e20

Martin Storsjö authored 7 years ago

In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.
Signed-off-by: Martin Storsjö <martin@martin.st>

3a0d5e20

arm: vp9itxfm: Template the quarter/half idct32 function · 98ee855a

Martin Storsjö authored 7 years ago

This reduces the number of lines and reduces the duplication.

Also simplify the eob check for the half case.

If we are in the half case, we know we at least will need to do the
first three slices, we only need to check eob for the fourth one,
so we can hardcode the value to check against instead of loading
from the min_eob array.

Since at most one slice can be skipped in the first pass, we can
unroll the loop for filling zeros completely, as it was done for
the quarter case before.

This allows skipping loading the min_eob pointer when using the
quarter/half cases.
Signed-off-by: Martin Storsjö <martin@martin.st>

98ee855a

arm: vp9itxfm: Reorder iadst16 coeffs · b2e20d89

Martin Storsjö authored 8 years ago

This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

This is cherrypicked from libav commit
08074c09.
Signed-off-by: Martin Storsjö <martin@martin.st>

b2e20d89

arm: vp9itxfm: Reorder the idct coefficients for better pairing · 4f693b56

Martin Storsjö authored 8 years ago

All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

This is cherrypicked from libav commit
de06bdfe.
Signed-off-by: Martin Storsjö <martin@martin.st>

4f693b56

arm: vp9itxfm: Avoid reloading the idct32 coefficients · 600f4c9b

Martin Storsjö authored 8 years ago

The idct32x32 function actually pushed q4-q7 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
in the idct16 function), and the lanewise vmul needs a register in
the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
while doing idct16.

While keeping these coefficients in registers, we still can skip pushing
q7.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
After:
vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8

This is cherrypicked from libav commit
402546a1.
Signed-off-by: Martin Storsjö <martin@martin.st>

600f4c9b

arm: vp9lpf: Implement the mix2_44 function with one single filter pass · a88db8b9

Martin Storsjö authored 8 years ago

For this case, with 8 inputs but only changing 4 of them, we can fit
all 16 input pixels into a q register, and still have enough temporary
registers for doing the loop filter.

The wd=8 filters would require too many temporary registers for
processing all 16 pixels at once though.

Before:                          Cortex A7      A8     A9     A53
vp9_loop_filter_mix2_v_44_16_neon:   289.7   256.2  237.5   181.2
After:
vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  177.7   138.0

This is cherrypicked from libav commit
575e31e9.
Signed-off-by: Martin Storsjö <martin@martin.st>

a88db8b9

arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit · 3fbbad29

Martin Storsjö authored 8 years ago

The theoretical maximum value of E is 193, so we can just
saturate the addition to 255.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
After:
vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7

This is cherrypicked from libav commit
c582cb85.
Signed-off-by: Martin Storsjö <martin@martin.st>

3fbbad29

arm: vp9lpf: Interleave the start of flat8in into the calculation above · 83399cf5

Martin Storsjö authored 8 years ago

This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.

This is cherrypicked from libav commit
e18c3900.
Signed-off-by: Martin Storsjö <martin@martin.st>

83399cf5

arm: vp9lpf: Use orrs instead of orr+cmp · 92ab8374

Martin Storsjö authored 8 years ago

This is cherrypicked from libav commit
435cd7bc.
Signed-off-by: Martin Storsjö <martin@martin.st>

92ab8374

arm/aarch64: vp9lpf: Calculate !hev directly · f0ecbb13

Martin Storsjö authored 8 years ago

Previously we first calculated hev, and then negated it.

Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     147.0   129.0   115.8    89.0         88.7
vp9_loop_filter_v_8_8_neon:     242.0   198.5   174.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    500.0   419.5   382.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   971.2   825.5   731.5   579.0        453.0
After:
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0

This is cherrypicked from libav commit
e1f9de86.
Signed-off-by: Martin Storsjö <martin@martin.st>

f0ecbb13

arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling · 758302e4

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

Before:                            Cortex A7      A8      A9     A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   273.0   189.5   211.7   235.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   752.0   459.2   862.2   553.9
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   226.5   145.0   225.1   171.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   721.2   415.7   727.6   475.0

This is cherrypicked from libav commit
a76bf8cf.
Signed-off-by: Martin Storsjö <martin@martin.st>

758302e4

arm: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter · bff07715

Martin Storsjö authored 8 years ago

Before:                    Cortex A7      A8     A9     A53
vp9_put_8tap_smooth_4h_neon:   378.1   273.2  340.7   229.5
After:
vp9_put_8tap_smooth_4h_neon:   352.1   222.2  290.5   229.5

This is cherrypicked from libav commit
fea92a4b.
Signed-off-by: Martin Storsjö <martin@martin.st>

bff07715