Commits · 755ad01dd1dffc6209a9f71641e1c4169bb7691a · Linshizhi / ffmpeg.wasm-core

21 Jun, 2017 1 commit

aarch64: vp9: Fix assembling with Xcode 6.2 and older · 998609dd

Memphiz authored 7 years ago

Properly use the b.eq/b.ge forms instead of the nonstandard forms
(which both gas and newer clang accept though), and expand the
register list that used a range (which the Xcode 6.2 clang, based
on clang 3.5 svn, didn't support).

This is cherrypicked from libav commit
a970f9de.
Signed-off-by: Martin Storsjö <martin@martin.st>

998609dd

20 Jun, 2017 1 commit

aarch64: vp9: Fix assembling with Xcode 6.2 and older · a970f9de

Memphiz authored 7 years ago

Properly use the b.eq/b.ge forms instead of the nonstandard forms
(which both gas and newer clang accept though), and expand the
register list that used a range (which the Xcode 6.2 clang, based
on clang 3.5 svn, didn't support).
Signed-off-by: Martin Storsjö <martin@martin.st>

a970f9de

19 Mar, 2017 2 commits

arm/aarch64: vp9: Fix vertical alignment · 21c89f3a

Martin Storsjö authored 8 years ago

Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

This is cherrypicked from libav commit
7995ebfa.
Signed-off-by: Martin Storsjö <martin@martin.st>

21c89f3a

arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 70317b25

Martin Storsjö authored 7 years ago

In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

This is cherrypicked from libav commit
3a0d5e20.
Signed-off-by: Martin Storsjö <martin@martin.st>

70317b25

16 Mar, 2017 1 commit

arm/aarch64: vp9: Fix vertical alignment · 7995ebfa

Martin Storsjö authored 8 years ago

Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.
Signed-off-by: Martin Storsjö <martin@martin.st>

7995ebfa

11 Mar, 2017 14 commits

arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 3a0d5e20

Martin Storsjö authored 7 years ago

In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.
Signed-off-by: Martin Storsjö <martin@martin.st>

3a0d5e20

aarch64: vp9itxfm: Reorder iadst16 coeffs · 26ee83ac

Martin Storsjö authored 8 years ago

This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

This is cherrypicked from libav commit
b8f66c08.
Signed-off-by: Martin Storsjö <martin@martin.st>

26ee83ac

aarch64: vp9itxfm: Reorder the idct coefficients for better pairing · f9522730

Martin Storsjö authored 8 years ago

All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

This is cherrypicked from libav commit
09eb88a1.
Signed-off-by: Martin Storsjö <martin@martin.st>

f9522730

aarch64: vp9itxfm: Avoid reloading the idct32 coefficients · 2905657b

Martin Storsjö authored 8 years ago

The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

After this, we still can skip pushing d12-d15.

Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3

This is cherrypicked from libav commit
65aa002d.
Signed-off-by: Martin Storsjö <martin@martin.st>

2905657b

aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling · 148cc0bb

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

Before:                           Cortex A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   235.3
vp9_inv_dct_dct_32x32_sub1_add_neon:   555.1
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   180.2
vp9_inv_dct_dct_32x32_sub1_add_neon:   475.3

This is cherrypicked from libav commit
3fcf788f.
Signed-off-by: Martin Storsjö <martin@martin.st>

148cc0bb

aarch64: vp9itxfm: Fix incorrect vertical alignment · 16ef0007
Martin Storsjö authored 8 years ago
```
This is cherrypicked from libav commit
0c0b87f1.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
16ef0007
aarch64: vp9itxfm: Update a comment to refer to a register with a different name · d0fbf7f3
Martin Storsjö authored 8 years ago
```
This is cherrypicked from libav commit
8476eb0d.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
d0fbf7f3
aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability · 6752318c
Martin Storsjö authored 8 years ago
```
This is cherrypicked from libav commit
3dd78272.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
6752318c

aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible · 19a0f952

Martin Storsjö authored 8 years ago

The ld1r is a leftover from the arm version, where this trick is
beneficial on some cores.

Use a single-lane load where we don't need the semantics of ld1r.

This is cherrypicked from libav commit
ed8d2933.
Signed-off-by: Martin Storsjö <martin@martin.st>

19a0f952

aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function · 3006e525
Martin Storsjö authored 8 years ago
```
This is cherrypicked from libav commit
4da4b2b8.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
3006e525

aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 · 9532a7d4

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 14740 bytes to 24292 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:
vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
vp9_inv_dct_dct_16x16_sub2_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1387.4
vp9_inv_dct_dct_16x16_sub16_add_neon:   1387.6
vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    5198.5
vp9_inv_dct_dct_32x32_sub4_add_neon:    5198.6
vp9_inv_dct_dct_32x32_sub8_add_neon:    5196.3
vp9_inv_dct_dct_32x32_sub12_add_neon:   6183.4
vp9_inv_dct_dct_32x32_sub16_add_neon:   6174.3
vp9_inv_dct_dct_32x32_sub20_add_neon:   7151.4
vp9_inv_dct_dct_32x32_sub24_add_neon:   7145.3
vp9_inv_dct_dct_32x32_sub28_add_neon:   8119.3
vp9_inv_dct_dct_32x32_sub32_add_neon:   8118.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
vp9_inv_dct_dct_16x16_sub2_add_neon:     640.8
vp9_inv_dct_dct_16x16_sub4_add_neon:     639.0
vp9_inv_dct_dct_16x16_sub8_add_neon:     842.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1388.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    3685.5
vp9_inv_dct_dct_32x32_sub4_add_neon:    3685.1
vp9_inv_dct_dct_32x32_sub8_add_neon:    3684.4
vp9_inv_dct_dct_32x32_sub12_add_neon:   5312.2
vp9_inv_dct_dct_32x32_sub16_add_neon:   5315.4
vp9_inv_dct_dct_32x32_sub20_add_neon:   7154.9
vp9_inv_dct_dct_32x32_sub24_add_neon:   7154.5
vp9_inv_dct_dct_32x32_sub28_add_neon:   8126.6
vp9_inv_dct_dct_32x32_sub32_add_neon:   8127.2

This is cherrypicked from libav commit
a63da451.
Signed-off-by: Martin Storsjö <martin@martin.st>

9532a7d4

aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · a681c793

Martin Storsjö authored 8 years ago

This allows reusing the macro for a separate implementation of the
pass2 function.

This is cherrypicked from libav commit
79d332eb.
Signed-off-by: Martin Storsjö <martin@martin.st>

a681c793

aarch64: vp9itxfm: Make the larger core transforms standalone functions · dc47bf38

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from
19496 to 14740 bytes.

This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.

Before:
vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
vp9_inv_dct_dct_32x32_sub32_add_neon:   8095.7

After:
vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub16_add_neon:   1390.1
vp9_inv_dct_dct_32x32_sub4_add_neon:    5199.9
vp9_inv_dct_dct_32x32_sub32_add_neon:   8125.8

This is cherrypicked from libav commit
11547601.
Signed-off-by: Martin Storsjö <martin@martin.st>

dc47bf38

aarch64: vp9itxfm: Restructure the idct32 store macros · 52c7366c

Martin Storsjö authored 8 years ago

This avoids concatenation, which can't be used if the whole macro
is wrapped within another macro.

This is also arguably more readable.

This is cherrypicked from libav commit
58d87e0f.
Signed-off-by: Martin Storsjö <martin@martin.st>

52c7366c

23 Feb, 2017 3 commits

aarch64: vp9itxfm: Reorder iadst16 coeffs · b8f66c08

Martin Storsjö authored 8 years ago

This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.
Signed-off-by: Martin Storsjö <martin@martin.st>

b8f66c08

aarch64: vp9itxfm: Reorder the idct coefficients for better pairing · 09eb88a1

Martin Storsjö authored 8 years ago

All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
Signed-off-by: Martin Storsjö <martin@martin.st>

09eb88a1

aarch64: vp9itxfm: Avoid reloading the idct32 coefficients · 65aa002d

Martin Storsjö authored 8 years ago

The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

After this, we still can skip pushing d12-d15.

Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3
Signed-off-by: Martin Storsjö <martin@martin.st>

65aa002d

10 Feb, 2017 1 commit

aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling · 3fcf788f

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

Before:                           Cortex A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   235.3
vp9_inv_dct_dct_32x32_sub1_add_neon:   555.1
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   180.2
vp9_inv_dct_dct_32x32_sub1_add_neon:   475.3
Signed-off-by: Martin Storsjö <martin@martin.st>

3fcf788f

09 Feb, 2017 8 commits

aarch64: vp9itxfm: Fix incorrect vertical alignment · 0c0b87f1
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
0c0b87f1
aarch64: vp9itxfm: Update a comment to refer to a register with a different name · 8476eb0d
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
8476eb0d
aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability · 3dd78272
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
3dd78272

aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible · ed8d2933

Martin Storsjö authored 8 years ago

The ld1r is a leftover from the arm version, where this trick is
beneficial on some cores.

Use a single-lane load where we don't need the semantics of ld1r.
Signed-off-by: Martin Storsjö <martin@martin.st>

ed8d2933

aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function · 4da4b2b8
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
4da4b2b8

aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 · a63da451

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 14740 bytes to 24292 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:
vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
vp9_inv_dct_dct_16x16_sub2_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1387.4
vp9_inv_dct_dct_16x16_sub16_add_neon:   1387.6
vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    5198.5
vp9_inv_dct_dct_32x32_sub4_add_neon:    5198.6
vp9_inv_dct_dct_32x32_sub8_add_neon:    5196.3
vp9_inv_dct_dct_32x32_sub12_add_neon:   6183.4
vp9_inv_dct_dct_32x32_sub16_add_neon:   6174.3
vp9_inv_dct_dct_32x32_sub20_add_neon:   7151.4
vp9_inv_dct_dct_32x32_sub24_add_neon:   7145.3
vp9_inv_dct_dct_32x32_sub28_add_neon:   8119.3
vp9_inv_dct_dct_32x32_sub32_add_neon:   8118.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
vp9_inv_dct_dct_16x16_sub2_add_neon:     640.8
vp9_inv_dct_dct_16x16_sub4_add_neon:     639.0
vp9_inv_dct_dct_16x16_sub8_add_neon:     842.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1388.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    3685.5
vp9_inv_dct_dct_32x32_sub4_add_neon:    3685.1
vp9_inv_dct_dct_32x32_sub8_add_neon:    3684.4
vp9_inv_dct_dct_32x32_sub12_add_neon:   5312.2
vp9_inv_dct_dct_32x32_sub16_add_neon:   5315.4
vp9_inv_dct_dct_32x32_sub20_add_neon:   7154.9
vp9_inv_dct_dct_32x32_sub24_add_neon:   7154.5
vp9_inv_dct_dct_32x32_sub28_add_neon:   8126.6
vp9_inv_dct_dct_32x32_sub32_add_neon:   8127.2
Signed-off-by: Martin Storsjö <martin@martin.st>

a63da451

aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 79d332eb
Martin Storsjö authored 8 years ago
```
This allows reusing the macro for a separate implementation of the
pass2 function.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
79d332eb

aarch64: vp9itxfm: Make the larger core transforms standalone functions · 11547601

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from
19496 to 14740 bytes.

This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.

Before:
vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
vp9_inv_dct_dct_32x32_sub32_add_neon:   8095.7

After:
vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub16_add_neon:   1390.1
vp9_inv_dct_dct_32x32_sub4_add_neon:    5199.9
vp9_inv_dct_dct_32x32_sub32_add_neon:   8125.8
Signed-off-by: Martin Storsjö <martin@martin.st>

11547601

05 Feb, 2017 1 commit

aarch64: vp9itxfm: Restructure the idct32 store macros · 58d87e0f

Martin Storsjö authored 8 years ago

This avoids concatenation, which can't be used if the whole macro
is wrapped within another macro.

This is also arguably more readable.
Signed-off-by: Martin Storsjö <martin@martin.st>

58d87e0f

14 Jan, 2017 4 commits

aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · 8b11a89c

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:

vp9_inv_dct_dct_16x16_sub16_add_neon:   1373.2
vp9_inv_dct_dct_32x32_sub32_add_neon:   8089.0

By skipping individual 8x16 or 8x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:

vp9_inv_dct_dct_16x16_sub1_add_neon:     235.3
vp9_inv_dct_dct_16x16_sub2_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub8_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   1372.1
vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.1
vp9_inv_dct_dct_32x32_sub1_add_neon:     555.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    5190.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
vp9_inv_dct_dct_32x32_sub8_add_neon:    5183.1
vp9_inv_dct_dct_32x32_sub12_add_neon:   6161.5
vp9_inv_dct_dct_32x32_sub16_add_neon:   6155.5
vp9_inv_dct_dct_32x32_sub20_add_neon:   7136.3
vp9_inv_dct_dct_32x32_sub24_add_neon:   7128.4
vp9_inv_dct_dct_32x32_sub28_add_neon:   8098.9
vp9_inv_dct_dct_32x32_sub32_add_neon:   8098.8

I.e. in general a very minor overhead for the full subpartition case due
to the additional cmps, but a significant speedup for the cases when we
only need to process a small part of the actual input data.

This is cherrypicked from libav commits
cad42fad and
a0c443a3.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

8b11a89c

aarch64: vp9itxfm: Don't repeatedly set x9 when nothing overwrites it · 37cb224e
Martin Storsjö authored 8 years ago
```
This is cherrypicked from libav commit
2f99117f.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
```
37cb224e

arm/aarch64: vp9itxfm: Fix indentation of macro arguments · 4a5874ea

Martin Storsjö authored 8 years ago

This is cherrypicked from libav commit
721bc375.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

4a5874ea

aarch64: vp9itxfm: Use w3 instead of x3 for the int eob parameter · a95e7de4

Martin Storsjö authored 8 years ago

The clobbering tests in checkasm are only invoked when testing
correctness, so this bug didn't show up when benchmarking the
dc-only version.

This is cherrypicked from libav commit
4d960a11.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

a95e7de4

19 Dec, 2016 1 commit

aarch64: vp9itxfm: Use the offset parameter to movrel · a0c443a3

Martin Storsjö authored 8 years ago

This fixes build failures for iOS, broken since cad42fad.
Signed-off-by: Martin Storsjö <martin@martin.st>

a0c443a3

30 Nov, 2016 1 commit

aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · cad42fad

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:

vp9_inv_dct_dct_16x16_sub16_add_neon:   1373.2
vp9_inv_dct_dct_32x32_sub32_add_neon:   8089.0

By skipping individual 8x16 or 8x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:

vp9_inv_dct_dct_16x16_sub1_add_neon:     235.3
vp9_inv_dct_dct_16x16_sub2_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub8_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   1372.1
vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.1
vp9_inv_dct_dct_32x32_sub1_add_neon:     555.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    5190.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
vp9_inv_dct_dct_32x32_sub8_add_neon:    5183.1
vp9_inv_dct_dct_32x32_sub12_add_neon:   6161.5
vp9_inv_dct_dct_32x32_sub16_add_neon:   6155.5
vp9_inv_dct_dct_32x32_sub20_add_neon:   7136.3
vp9_inv_dct_dct_32x32_sub24_add_neon:   7128.4
vp9_inv_dct_dct_32x32_sub28_add_neon:   8098.9
vp9_inv_dct_dct_32x32_sub32_add_neon:   8098.8

I.e. in general a very minor overhead for the full subpartition case due
to the additional cmps, but a significant speedup for the cases when we
only need to process a small part of the actual input data.
Signed-off-by: Martin Storsjö <martin@martin.st>

cad42fad

24 Nov, 2016 1 commit
- aarch64: vp9itxfm: Don't repeatedly set x9 when nothing overwrites it · 2f99117f
  Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  2f99117f
23 Nov, 2016 1 commit
- arm/aarch64: vp9itxfm: Fix indentation of macro arguments · 721bc375
  Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  721bc375