Commits · 22aed77687f86b6b9efb9b0fdf4b26130a858c6d · Linshizhi / ffmpeg.wasm-core

18 Oct, 2017 1 commit

aarch64: Remove a dot from a label · 73251063

Martin Storsjö authored 7 years ago

This fixes building with armasm64 (when run through gas-preprocessor).
Signed-off-by: Martin Storsjö <martin@martin.st>

73251063

20 Jun, 2017 1 commit

aarch64: vp9: Fix assembling with Xcode 6.2 and older · a970f9de

Memphiz authored 7 years ago

Properly use the b.eq/b.ge forms instead of the nonstandard forms
(which both gas and newer clang accept though), and expand the
register list that used a range (which the Xcode 6.2 clang, based
on clang 3.5 svn, didn't support).
Signed-off-by: Martin Storsjö <martin@martin.st>

a970f9de

16 Mar, 2017 1 commit

arm/aarch64: vp9: Fix vertical alignment · 7995ebfa

Martin Storsjö authored 8 years ago

Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.
Signed-off-by: Martin Storsjö <martin@martin.st>

7995ebfa

11 Mar, 2017 1 commit

arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 3a0d5e20

Martin Storsjö authored 8 years ago

In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.
Signed-off-by: Martin Storsjö <martin@martin.st>

3a0d5e20

23 Feb, 2017 5 commits

aarch64: vp9itxfm: Reorder iadst16 coeffs · b8f66c08

Martin Storsjö authored 8 years ago

This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.
Signed-off-by: Martin Storsjö <martin@martin.st>

b8f66c08

aarch64: vp9itxfm: Reorder the idct coefficients for better pairing · 09eb88a1

Martin Storsjö authored 8 years ago

All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
Signed-off-by: Martin Storsjö <martin@martin.st>

09eb88a1

aarch64: vp9itxfm: Avoid reloading the idct32 coefficients · 65aa002d

Martin Storsjö authored 8 years ago

The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

After this, we still can skip pushing d12-d15.

Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3
Signed-off-by: Martin Storsjö <martin@martin.st>

65aa002d

aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1 · 3bf9c483

Martin Storsjö authored 8 years ago

This is one cycle faster in total, and three instructions fewer.

Before:
vp9_loop_filter_mix2_v_44_16_neon: 123.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 122.2
Signed-off-by: Martin Storsjö <martin@martin.st>

3bf9c483

arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit · c582cb85

Martin Storsjö authored 8 years ago

The theoretical maximum value of E is 193, so we can just
saturate the addition to 255.

Before: Cortex A7 A8 A9 A53 A53/AArch64
vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7
vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0
After:
vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.6 84.0 83.0
vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0 133.7
vp9_loop_filter_v_16_8_neon: 490.0 417.5 377.7 289.0 271.0
vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0 446.7
Signed-off-by: Martin Storsjö <martin@martin.st>

c582cb85

12 Feb, 2017 1 commit
- aarch64: vp9lpf: Fix broken indentation/vertical alignment · 07b5136c
  Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  07b5136c
11 Feb, 2017 1 commit

aarch64: vp9lpf: Interleave the start of flat8in into the calculation above · b0806088

Martin Storsjö authored 8 years ago

This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.
Signed-off-by: Martin Storsjö <martin@martin.st>

b0806088

10 Feb, 2017 4 commits

arm/aarch64: vp9lpf: Calculate !hev directly · e1f9de86

Martin Storsjö authored 8 years ago

Previously we first calculated hev, and then negated it.

Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.

Before: Cortex A7 A8 A9 A53 A53/AArch64
vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7
vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0
After:
vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7
vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0
Signed-off-by: Martin Storsjö <martin@martin.st>

e1f9de86

aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling · 3fcf788f

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

Before:                           Cortex A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   235.3
vp9_inv_dct_dct_32x32_sub1_add_neon:   555.1
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   180.2
vp9_inv_dct_dct_32x32_sub1_add_neon:   475.3
Signed-off-by: Martin Storsjö <martin@martin.st>

3fcf788f

aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter · 388e0d25
Martin Storsjö authored 8 years ago
```
No measured speedup on a Cortex A53, but other cores might benefit.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
388e0d25

aarch64: vp9mc: Simplify the extmla macro parameters · 5e0c2158

Martin Storsjö authored 8 years ago

Fold the field lengths into the macro.

This makes the macro invocations much more readable, when the
lines are shorter.

This also makes it easier to use only half the registers within
the macro.
Signed-off-by: Martin Storsjö <martin@martin.st>

5e0c2158

09 Feb, 2017 8 commits

aarch64: vp9itxfm: Fix incorrect vertical alignment · 0c0b87f1
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
0c0b87f1
aarch64: vp9itxfm: Update a comment to refer to a register with a different name · 8476eb0d
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
8476eb0d
aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability · 3dd78272
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
3dd78272

aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible · ed8d2933

Martin Storsjö authored 8 years ago

The ld1r is a leftover from the arm version, where this trick is
beneficial on some cores.

Use a single-lane load where we don't need the semantics of ld1r.
Signed-off-by: Martin Storsjö <martin@martin.st>

ed8d2933

aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function · 4da4b2b8
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
4da4b2b8

aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 · a63da451

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 14740 bytes to 24292 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:
vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
vp9_inv_dct_dct_16x16_sub2_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1387.4
vp9_inv_dct_dct_16x16_sub16_add_neon:   1387.6
vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    5198.5
vp9_inv_dct_dct_32x32_sub4_add_neon:    5198.6
vp9_inv_dct_dct_32x32_sub8_add_neon:    5196.3
vp9_inv_dct_dct_32x32_sub12_add_neon:   6183.4
vp9_inv_dct_dct_32x32_sub16_add_neon:   6174.3
vp9_inv_dct_dct_32x32_sub20_add_neon:   7151.4
vp9_inv_dct_dct_32x32_sub24_add_neon:   7145.3
vp9_inv_dct_dct_32x32_sub28_add_neon:   8119.3
vp9_inv_dct_dct_32x32_sub32_add_neon:   8118.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
vp9_inv_dct_dct_16x16_sub2_add_neon:     640.8
vp9_inv_dct_dct_16x16_sub4_add_neon:     639.0
vp9_inv_dct_dct_16x16_sub8_add_neon:     842.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1388.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    3685.5
vp9_inv_dct_dct_32x32_sub4_add_neon:    3685.1
vp9_inv_dct_dct_32x32_sub8_add_neon:    3684.4
vp9_inv_dct_dct_32x32_sub12_add_neon:   5312.2
vp9_inv_dct_dct_32x32_sub16_add_neon:   5315.4
vp9_inv_dct_dct_32x32_sub20_add_neon:   7154.9
vp9_inv_dct_dct_32x32_sub24_add_neon:   7154.5
vp9_inv_dct_dct_32x32_sub28_add_neon:   8126.6
vp9_inv_dct_dct_32x32_sub32_add_neon:   8127.2
Signed-off-by: Martin Storsjö <martin@martin.st>

a63da451

aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 79d332eb
Martin Storsjö authored 8 years ago
```
This allows reusing the macro for a separate implementation of the
pass2 function.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
79d332eb

aarch64: vp9itxfm: Make the larger core transforms standalone functions · 11547601

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from
19496 to 14740 bytes.

This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.

Before:
vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
vp9_inv_dct_dct_32x32_sub32_add_neon:   8095.7

After:
vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub16_add_neon:   1390.1
vp9_inv_dct_dct_32x32_sub4_add_neon:    5199.9
vp9_inv_dct_dct_32x32_sub32_add_neon:   8125.8
Signed-off-by: Martin Storsjö <martin@martin.st>

11547601

05 Feb, 2017 1 commit

aarch64: vp9itxfm: Restructure the idct32 store macros · 58d87e0f

Martin Storsjö authored 8 years ago

This avoids concatenation, which can't be used if the whole macro
is wrapped within another macro.

This is also arguably more readable.
Signed-off-by: Martin Storsjö <martin@martin.st>

58d87e0f

03 Jan, 2017 2 commits
- aarch64: vp9mc: Fix a comment to refer to a register with the right name · 85ad5ea7
  Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  85ad5ea7
- aarch64: vp9dsp: Fix vertical alignment in the init file · 65074791
  Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  65074791
19 Dec, 2016 1 commit

aarch64: vp9itxfm: Use the offset parameter to movrel · a0c443a3

Martin Storsjö authored 8 years ago

This fixes build failures for iOS, broken since cad42fad.
Signed-off-by: Martin Storsjö <martin@martin.st>

a0c443a3

14 Dec, 2016 1 commit

arm64: replace 'bic' with immediate with 'and' with inverted immediate · 2425d732

Janne Grunau authored 8 years ago

The former is not an official pseudo instruction although gas and llvm's
internal assembler support it. Fixes a build error with xcode 6.2
reported by Memphiz on github.

2425d732

30 Nov, 2016 1 commit

aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · cad42fad

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:

vp9_inv_dct_dct_16x16_sub16_add_neon:   1373.2
vp9_inv_dct_dct_32x32_sub32_add_neon:   8089.0

By skipping individual 8x16 or 8x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:

vp9_inv_dct_dct_16x16_sub1_add_neon:     235.3
vp9_inv_dct_dct_16x16_sub2_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub8_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   1372.1
vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.1
vp9_inv_dct_dct_32x32_sub1_add_neon:     555.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    5190.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
vp9_inv_dct_dct_32x32_sub8_add_neon:    5183.1
vp9_inv_dct_dct_32x32_sub12_add_neon:   6161.5
vp9_inv_dct_dct_32x32_sub16_add_neon:   6155.5
vp9_inv_dct_dct_32x32_sub20_add_neon:   7136.3
vp9_inv_dct_dct_32x32_sub24_add_neon:   7128.4
vp9_inv_dct_dct_32x32_sub28_add_neon:   8098.9
vp9_inv_dct_dct_32x32_sub32_add_neon:   8098.8

I.e. in general a very minor overhead for the full subpartition case due
to the additional cmps, but a significant speedup for the cases when we
only need to process a small part of the actual input data.
Signed-off-by: Martin Storsjö <martin@martin.st>

cad42fad

24 Nov, 2016 1 commit
- aarch64: vp9itxfm: Don't repeatedly set x9 when nothing overwrites it · 2f99117f
  Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  2f99117f
23 Nov, 2016 1 commit
- arm/aarch64: vp9itxfm: Fix indentation of macro arguments · 721bc375
  Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  721bc375
18 Nov, 2016 1 commit

aarch64: vp9itxfm: Use w3 instead of x3 for the int eob parameter · 4d960a11

Martin Storsjö authored 8 years ago

The clobbering tests in checkasm are only invoked when testing
correctness, so this bug didn't show up when benchmarking the
dc-only version.
Signed-off-by: Martin Storsjö <martin@martin.st>

4d960a11

16 Nov, 2016 2 commits

aarch64: vp9: loop filter: replace 'orr; cbn?z' with 'adds; b.{eq,ne}; · e7ae8f7a

Janne Grunau authored 8 years ago

The latter is 1 cycle faster on a cortex-53 and since the operands are
bytewise (or larger) bitmask (impossible to overflow to zero) both are
equivalent.

e7ae8f7a

aarch64: vp9: use alternative returns in the core loop filter function · d7595de0

Janne Grunau authored 8 years ago

Since aarch64 has enough free general purpose registers use them to
branch to the appropiate storage code. 1-2 cycles faster for the
functions using loop_filter 8/16, ... on a cortex-a53. Mixed results
(up to 2 cycles faster/slower) on a cortex-a57.

d7595de0

14 Nov, 2016 1 commit

aarch64: vp9: loop_filter: fix typo in skip flatout8 check · 31756abe

Janne Grunau authored 8 years ago

The 16_16 loop filter functions could miss an early exit before
flatout8.
Signed-off-by: Martin Storsjö <martin@martin.st>

31756abe

13 Nov, 2016 2 commits

aarch64: vp9: Implement NEON loop filters · 9d2afd1e

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

These are ported from the ARM version; thanks to the larger
amount of registers available, we can do the loop filters with
16 pixels at a time. The implementation is fully templated, with
a single macro which can generate versions for both 8 and
16 pixels wide, for both 4, 8 and 16 pixels loop filters
(and the 4/8 mixed versions as well).

For the 8 pixel wide versions, it is pretty close in speed (the
v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
and h_8_8 filters seem to get some gain in the load/transpose/store
part). For the 16 pixels wide ones, we get a speedup of around
1.2-1.4x compared to the 32 bit version.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                       ARM AArch64
vp9_loop_filter_h_4_8_neon:          144.0   127.2
vp9_loop_filter_h_8_8_neon:          207.0   182.5
vp9_loop_filter_h_16_8_neon:         415.0   328.7
vp9_loop_filter_h_16_16_neon:        672.0   558.6
vp9_loop_filter_mix2_h_44_16_neon:   302.0   203.5
vp9_loop_filter_mix2_h_48_16_neon:   365.0   305.2
vp9_loop_filter_mix2_h_84_16_neon:   365.0   305.2
vp9_loop_filter_mix2_h_88_16_neon:   376.0   305.2
vp9_loop_filter_mix2_v_44_16_neon:   193.2   128.2
vp9_loop_filter_mix2_v_48_16_neon:   246.7   218.4
vp9_loop_filter_mix2_v_84_16_neon:   248.0   218.5
vp9_loop_filter_mix2_v_88_16_neon:   302.0   218.2
vp9_loop_filter_v_4_8_neon:           89.0    88.7
vp9_loop_filter_v_8_8_neon:          141.0   137.7
vp9_loop_filter_v_16_8_neon:         295.0   272.7
vp9_loop_filter_v_16_16_neon:        546.0   453.7

The speedup vs C code in checkasm tests is around 2-7x, which is
pretty much the same as for the 32 bit version. Even if these functions
are faster than their 32 bit equivalent, the C version that we compare
to also became around 1.3-1.7x faster than the C version in 32 bit.

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 4-5x.

Examples of runtimes vs C on a Cortex A57 (for a slightly older version
of the patch):
                         A57 gcc-5.3  neon
loop_filter_h_4_8_neon:        256.6  93.4
loop_filter_h_8_8_neon:        307.3 139.1
loop_filter_h_16_8_neon:       340.1 254.1
loop_filter_h_16_16_neon:      827.0 407.9
loop_filter_mix2_h_44_16_neon: 524.5 155.4
loop_filter_mix2_h_48_16_neon: 644.5 173.3
loop_filter_mix2_h_84_16_neon: 630.5 222.0
loop_filter_mix2_h_88_16_neon: 697.3 222.0
loop_filter_mix2_v_44_16_neon: 598.5 100.6
loop_filter_mix2_v_48_16_neon: 651.5 127.0
loop_filter_mix2_v_84_16_neon: 591.5 167.1
loop_filter_mix2_v_88_16_neon: 855.1 166.7
loop_filter_v_4_8_neon:        271.7  65.3
loop_filter_v_8_8_neon:        312.5 106.9
loop_filter_v_16_8_neon:       473.3 206.5
loop_filter_v_16_16_neon:      976.1 327.8

The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
is again 30-50% faster than the cortex-a53.
Signed-off-by: Martin Storsjö <martin@martin.st>

9d2afd1e

aarch64: vp9: Add NEON itxfm routines · 3c9546df

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

These are ported from the ARM version; thanks to the larger
amount of registers available, we can do the 16x16 and 32x32
transforms in slices 8 pixels wide instead of 4. This gives
a speedup of around 1.4x compared to the 32 bit version.

The fact that aarch64 doesn't have the same d/q register
aliasing makes some of the macros quite a bit simpler as well.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                       ARM  AArch64
vp9_inv_adst_adst_4x4_add_neon:       90.0     87.7
vp9_inv_adst_adst_8x8_add_neon:      400.0    354.7
vp9_inv_adst_adst_16x16_add_neon:   2526.5   1827.2
vp9_inv_dct_dct_4x4_add_neon:         74.0     72.7
vp9_inv_dct_dct_8x8_add_neon:        271.0    256.7
vp9_inv_dct_dct_16x16_add_neon:     1960.7   1372.7
vp9_inv_dct_dct_32x32_add_neon:    11988.9   8088.3
vp9_inv_wht_wht_4x4_add_neon:         63.0     57.7

The speedup vs C code (2-4x) is smaller than in the 32 bit case,
mostly because the C code ends up significantly faster (around
1.6x faster, with GCC 5.4) when built for aarch64.

Examples of runtimes vs C on a Cortex A57 (for a slightly older version
of the patch):
                                A57 gcc-5.3   neon
vp9_inv_adst_adst_4x4_add_neon:       152.2   60.0
vp9_inv_adst_adst_8x8_add_neon:       948.2  288.0
vp9_inv_adst_adst_16x16_add_neon:    4830.4 1380.5
vp9_inv_dct_dct_4x4_add_neon:         153.0   58.6
vp9_inv_dct_dct_8x8_add_neon:         789.2  180.2
vp9_inv_dct_dct_16x16_add_neon:      3639.6  917.1
vp9_inv_dct_dct_32x32_add_neon:     20462.1 4985.0
vp9_inv_wht_wht_4x4_add_neon:          91.0   49.8

The asm is around factor 3-4 faster than C on the cortex-a57 and the asm
is around 30-50% faster on the a57 compared to the a53.
Signed-off-by: Martin Storsjö <martin@martin.st>

3c9546df

10 Nov, 2016 2 commits

aarch64: h264idct: Use the offset parameter to movrel · 6a62795d
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
6a62795d

aarch64: vp9: Add NEON optimizations of VP9 MC functions · 383d96aa

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

These are ported from the ARM version; it is essentially a 1:1
port with no extra added features, but with some hand tuning
(especially for the plain copy/avg functions). The ARM version
isn't very register starved to begin with, so there's not much
to be gained from having more spare registers here - we only
avoid having to clobber callee-saved registers.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                     ARM   AArch64
vp9_avg4_neon:                      27.2      23.7
vp9_avg8_neon:                      56.5      54.7
vp9_avg16_neon:                    169.9     167.4
vp9_avg32_neon:                    585.8     585.2
vp9_avg64_neon:                   2460.3    2294.7
vp9_avg_8tap_smooth_4h_neon:       132.7     125.2
vp9_avg_8tap_smooth_4hv_neon:      478.8     442.0
vp9_avg_8tap_smooth_4v_neon:       126.0      93.7
vp9_avg_8tap_smooth_8h_neon:       241.7     234.2
vp9_avg_8tap_smooth_8hv_neon:      690.9     646.5
vp9_avg_8tap_smooth_8v_neon:       245.0     205.5
vp9_avg_8tap_smooth_64h_neon:    11273.2   11280.1
vp9_avg_8tap_smooth_64hv_neon:   22980.6   22184.1
vp9_avg_8tap_smooth_64v_neon:    11549.7   10781.1
vp9_put4_neon:                      18.0      17.2
vp9_put8_neon:                      40.2      37.7
vp9_put16_neon:                     97.4      99.5
vp9_put32_neon/armv8:              346.0     307.4
vp9_put64_neon/armv8:             1319.0    1107.5
vp9_put_8tap_smooth_4h_neon:       126.7     118.2
vp9_put_8tap_smooth_4hv_neon:      465.7     434.0
vp9_put_8tap_smooth_4v_neon:       113.0      86.5
vp9_put_8tap_smooth_8h_neon:       229.7     221.6
vp9_put_8tap_smooth_8hv_neon:      658.9     621.3
vp9_put_8tap_smooth_8v_neon:       215.0     187.5
vp9_put_8tap_smooth_64h_neon:    10636.7   10627.8
vp9_put_8tap_smooth_64hv_neon:   21076.8   21026.9
vp9_put_8tap_smooth_64v_neon:     9635.0    9632.4

These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.

The speedup vs C code is pretty much the same as for the 32 bit
case; on the A53 it's around 6-13x for ther larger 8tap filters.
The exact speedup varies a little, since the C versions generally
don't end up exactly as slow/fast as on 32 bit.
Signed-off-by: Martin Storsjö <martin@martin.st>

383d96aa

09 Nov, 2016 1 commit
- mpegaudiodsp: aarch64: Adjust function prototype after 2caa93b8 · 72a19f40
  Diego Biurrun authored 8 years ago
  
  72a19f40