Commits · 644130bcaa22ed42718e1e0aabcb0e398b8414ff · Linshizhi / ffmpeg.wasm-core

19 Mar, 2017 3 commits

arm/aarch64: vp9: Fix vertical alignment · 21c89f3a

Martin Storsjö authored 8 years ago

Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

This is cherrypicked from libav commit
7995ebfa.
Signed-off-by: Martin Storsjö <martin@martin.st>

21c89f3a

arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 70317b25

Martin Storsjö authored 7 years ago

In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

This is cherrypicked from libav commit
3a0d5e20.
Signed-off-by: Martin Storsjö <martin@martin.st>

70317b25

arm: vp9itxfm: Template the quarter/half idct32 function · b7a565fe

Martin Storsjö authored 7 years ago

This reduces the number of lines and reduces the duplication.

Also simplify the eob check for the half case.

If we are in the half case, we know we at least will need to do the
first three slices, we only need to check eob for the fourth one,
so we can hardcode the value to check against instead of loading
from the min_eob array.

Since at most one slice can be skipped in the first pass, we can
unroll the loop for filling zeros completely, as it was done for
the quarter case before.

This allows skipping loading the min_eob pointer when using the
quarter/half cases.

This is cherrypicked from libav commit
98ee855a.
Signed-off-by: Martin Storsjö <martin@martin.st>

b7a565fe

16 Mar, 2017 1 commit

arm/aarch64: vp9: Fix vertical alignment · 7995ebfa

Martin Storsjö authored 8 years ago

Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.
Signed-off-by: Martin Storsjö <martin@martin.st>

7995ebfa

11 Mar, 2017 11 commits

arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used · 3a0d5e20

Martin Storsjö authored 7 years ago

In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.
Signed-off-by: Martin Storsjö <martin@martin.st>

3a0d5e20

arm: vp9itxfm: Template the quarter/half idct32 function · 98ee855a

Martin Storsjö authored 7 years ago

This reduces the number of lines and reduces the duplication.

Also simplify the eob check for the half case.

If we are in the half case, we know we at least will need to do the
first three slices, we only need to check eob for the fourth one,
so we can hardcode the value to check against instead of loading
from the min_eob array.

Since at most one slice can be skipped in the first pass, we can
unroll the loop for filling zeros completely, as it was done for
the quarter case before.

This allows skipping loading the min_eob pointer when using the
quarter/half cases.
Signed-off-by: Martin Storsjö <martin@martin.st>

98ee855a

arm: vp9itxfm: Reorder iadst16 coeffs · b2e20d89

Martin Storsjö authored 8 years ago

This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

This is cherrypicked from libav commit
08074c09.
Signed-off-by: Martin Storsjö <martin@martin.st>

b2e20d89

arm: vp9itxfm: Reorder the idct coefficients for better pairing · 4f693b56

Martin Storsjö authored 8 years ago

All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

This is cherrypicked from libav commit
de06bdfe.
Signed-off-by: Martin Storsjö <martin@martin.st>

4f693b56

arm: vp9itxfm: Avoid reloading the idct32 coefficients · 600f4c9b

Martin Storsjö authored 8 years ago

The idct32x32 function actually pushed q4-q7 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
in the idct16 function), and the lanewise vmul needs a register in
the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
while doing idct16.

While keeping these coefficients in registers, we still can skip pushing
q7.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
After:
vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8

This is cherrypicked from libav commit
402546a1.
Signed-off-by: Martin Storsjö <martin@martin.st>

600f4c9b

arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling · 758302e4

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

Before:                            Cortex A7      A8      A9     A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   273.0   189.5   211.7   235.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   752.0   459.2   862.2   553.9
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   226.5   145.0   225.1   171.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   721.2   415.7   727.6   475.0

This is cherrypicked from libav commit
a76bf8cf.
Signed-off-by: Martin Storsjö <martin@martin.st>

758302e4

arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function · 1d8ab576
Martin Storsjö authored 8 years ago
```
This is cherrypicked from libav commit
3933b86b.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
1d8ab576

arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible · 82458955

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 12388 bytes to 19784 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 212.0 235.8
vp9_inv_dct_dct_16x16_sub2_add_neon: 2102.1 1521.7 1736.2 1265.8
vp9_inv_dct_dct_16x16_sub4_add_neon: 2104.5 1533.0 1736.6 1265.5
vp9_inv_dct_dct_16x16_sub8_add_neon: 2484.8 1828.7 2014.4 1506.5
vp9_inv_dct_dct_16x16_sub12_add_neon: 2851.2 2117.8 2294.8 1753.2
vp9_inv_dct_dct_16x16_sub16_add_neon: 3239.4 2408.3 2543.5 1994.9
vp9_inv_dct_dct_32x32_sub1_add_neon: 758.3 456.7 864.5 553.9
vp9_inv_dct_dct_32x32_sub2_add_neon: 10776.7 7949.8 8567.7 6819.7
vp9_inv_dct_dct_32x32_sub4_add_neon: 10865.6 8131.5 8589.6 6816.3
vp9_inv_dct_dct_32x32_sub8_add_neon: 12053.9 9271.3 9387.7 7564.0
vp9_inv_dct_dct_32x32_sub12_add_neon: 13328.3 10463.2 10217.0 8321.3
vp9_inv_dct_dct_32x32_sub16_add_neon: 14176.4 11509.5 11018.7 9062.3
vp9_inv_dct_dct_32x32_sub20_add_neon: 15301.5 12999.9 11855.1 9828.2
vp9_inv_dct_dct_32x32_sub24_add_neon: 16482.7 14931.5 12650.1 10575.0
vp9_inv_dct_dct_32x32_sub28_add_neon: 17589.5 15811.9 13482.8 11333.4
vp9_inv_dct_dct_32x32_sub32_add_neon: 18696.2 17049.2 14355.6 12089.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 211.7 235.8
vp9_inv_dct_dct_16x16_sub2_add_neon: 1203.5 998.2 1035.3 763.0
vp9_inv_dct_dct_16x16_sub4_add_neon: 1203.5 998.1 1035.5 760.8
vp9_inv_dct_dct_16x16_sub8_add_neon: 1926.1 1610.6 1722.1 1271.7
vp9_inv_dct_dct_16x16_sub12_add_neon: 2873.2 2129.7 2285.1 1757.3
vp9_inv_dct_dct_16x16_sub16_add_neon: 3221.4 2520.3 2557.6 2002.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 753.0 457.5 866.6 554.6
vp9_inv_dct_dct_32x32_sub2_add_neon: 7554.6 5652.4 6048.4 4920.2
vp9_inv_dct_dct_32x32_sub4_add_neon: 7549.9 5685.0 6046.9 4925.7
vp9_inv_dct_dct_32x32_sub8_add_neon: 8336.9 6704.5 6604.0 5478.0
vp9_inv_dct_dct_32x32_sub12_add_neon: 10914.0 9777.2 9240.4 7416.9
vp9_inv_dct_dct_32x32_sub16_add_neon: 11859.2 11223.3 9966.3 8095.1
vp9_inv_dct_dct_32x32_sub20_add_neon: 15237.1 13029.4 11838.3 9829.4
vp9_inv_dct_dct_32x32_sub24_add_neon: 16293.2 14379.8 12644.9 10572.0
vp9_inv_dct_dct_32x32_sub28_add_neon: 17424.3 15734.7 13473.0 11326.9
vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.3 17457.0 14298.6 12080.0

This is cherrypicked from libav commit
5eb5aec4.
Signed-off-by: Martin Storsjö <martin@martin.st>

82458955

arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 3bd9b391

Martin Storsjö authored 8 years ago

This allows reusing the macro for a separate implementation of the
pass2 function.

This is cherrypicked from libav commit
47b3c2c1.
Signed-off-by: Martin Storsjö <martin@martin.st>

3bd9b391

arm: vp9itxfm: Make the larger core transforms standalone functions · f8fcee0d

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
15324 to 12388 bytes.

This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub4_add_neon:    2063.4   1516.0   1719.5   1245.1
vp9_inv_dct_dct_16x16_sub16_add_neon:   3279.3   2454.5   2525.2   1982.3
vp9_inv_dct_dct_32x32_sub4_add_neon:   10750.0   7955.4   8525.6   6754.2
vp9_inv_dct_dct_32x32_sub32_add_neon:  18574.0  17108.4  14216.7  12010.2

After:
vp9_inv_dct_dct_16x16_sub4_add_neon:    2060.8   1608.5   1735.7   1262.0
vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.2   2443.5   2546.1   1999.5
vp9_inv_dct_dct_32x32_sub4_add_neon:   10682.0   8043.8   8581.3   6810.1
vp9_inv_dct_dct_32x32_sub32_add_neon:  18522.4  17277.4  14286.7  12087.9

This is cherrypicked from libav commit
0331c3f5.
Signed-off-by: Martin Storsjö <martin@martin.st>

f8fcee0d

arm: vp9itxfm: Avoid .irp when it doesn't save any lines · 31e41350

Martin Storsjö authored 8 years ago

This makes it more readable.

This is cherrypicked from libav commit
3bc5b28d.
Signed-off-by: Martin Storsjö <martin@martin.st>

31e41350

23 Feb, 2017 3 commits

arm: vp9itxfm: Reorder iadst16 coeffs · 08074c09

Martin Storsjö authored 8 years ago

This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.
Signed-off-by: Martin Storsjö <martin@martin.st>

08074c09

arm: vp9itxfm: Reorder the idct coefficients for better pairing · de06bdfe

Martin Storsjö authored 8 years ago

All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
Signed-off-by: Martin Storsjö <martin@martin.st>

de06bdfe

arm: vp9itxfm: Avoid reloading the idct32 coefficients · 402546a1

Martin Storsjö authored 8 years ago

The idct32x32 function actually pushed q4-q7 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
in the idct16 function), and the lanewise vmul needs a register in
the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
while doing idct16.

While keeping these coefficients in registers, we still can skip pushing
q7.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
After:
vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8
Signed-off-by: Martin Storsjö <martin@martin.st>

402546a1

10 Feb, 2017 1 commit

arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling · a76bf8cf

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

Before:                            Cortex A7      A8      A9     A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   273.0   189.5   211.7   235.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   752.0   459.2   862.2   553.9
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   226.5   145.0   225.1   171.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   721.2   415.7   727.6   475.0
Signed-off-by: Martin Storsjö <martin@martin.st>

a76bf8cf

09 Feb, 2017 4 commits

arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function · 3933b86b
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
3933b86b

arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible · 5eb5aec4

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 12388 bytes to 19784 bytes.

5eb5aec4

arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function · 47b3c2c1

Martin Storsjö authored 8 years ago

This allows reusing the macro for a separate implementation of the
pass2 function.
Signed-off-by: Martin Storsjö <martin@martin.st>

47b3c2c1

arm: vp9itxfm: Make the larger core transforms standalone functions · 0331c3f5

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
15324 to 12388 bytes.

This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.

Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub4_add_neon: 2063.4 1516.0 1719.5 1245.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 3279.3 2454.5 2525.2 1982.3
vp9_inv_dct_dct_32x32_sub4_add_neon: 10750.0 7955.4 8525.6 6754.2
vp9_inv_dct_dct_32x32_sub32_add_neon: 18574.0 17108.4 14216.7 12010.2

After:
vp9_inv_dct_dct_16x16_sub4_add_neon: 2060.8 1608.5 1735.7 1262.0
vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.2 2443.5 2546.1 1999.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 10682.0 8043.8 8581.3 6810.1
vp9_inv_dct_dct_32x32_sub32_add_neon: 18522.4 17277.4 14286.7 12087.9
Signed-off-by: Martin Storsjö <martin@martin.st>

0331c3f5

05 Feb, 2017 1 commit
- arm: vp9itxfm: Avoid .irp when it doesn't save any lines · 3bc5b28d
  Martin Storsjö authored 8 years ago
```
This makes it more readable.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  3bc5b28d
14 Jan, 2017 5 commits

arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · 388f6e67

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:

Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0
vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3

By skipping individual 4x16 or 4x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:

vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6 189.5 211.7 235.8
vp9_inv_dct_dct_16x16_sub2_add_neon: 2064.0 1534.8 1719.4 1248.7
vp9_inv_dct_dct_16x16_sub4_add_neon: 2135.0 1477.2 1736.3 1249.5
vp9_inv_dct_dct_16x16_sub8_add_neon: 2446.7 1828.7 1993.6 1494.7
vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2 456.7 862.0 553.9
vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8
vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5
vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2
vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6
vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6
vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0
vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7
vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1

I.e. in general a very minor overhead for the full subpartition case due
to the additional loads and cmps, but a significant speedup for the cases
when we only need to process a small part of the actual input data.

In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
8x8 or 16x16 subpartitions respectively.

This is cherrypicked from libav commit
9c8bc74c.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

388f6e67

arm: vp9itxfm: Only reload the idct coeffs for the iadst_idct combination · ecd343aa

Martin Storsjö authored 8 years ago

This avoids reloading them if they haven't been clobbered, if the
first pass also was idct.

This is similar to what was done in the aarch64 version.

This is cherrypicked from libav commit
3c87039a.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

ecd343aa

arm: vp9itxfm: Rename a macro parameter to fit better · f69dd26d

Martin Storsjö authored 8 years ago

Since the same parameter is used for both input and output,
the name inout is more fitting.

This matches the naming used below in the dmbutterfly macro.

This is cherrypicked from libav commit
79566ec8.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

f69dd26d

arm/aarch64: vp9itxfm: Fix indentation of macro arguments · 4a5874ea

Martin Storsjö authored 8 years ago

This is cherrypicked from libav commit
721bc375.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

4a5874ea

arm: vp9itxfm: Simplify the stack alignment code · a71cd843

Janne Grunau authored 8 years ago

This is one instruction less for thumb, and only have got
1/2 arm/thumb specific instructions.

This is cherrypicked from libav commit
e5b0fc17.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

a71cd843

30 Nov, 2016 2 commits

arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 · 9c8bc74c

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:

Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0
vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3

By skipping individual 4x16 or 4x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:

In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
8x8 or 16x16 subpartitions respectively.
Signed-off-by: Martin Storsjö <martin@martin.st>

9c8bc74c

arm: vp9itxfm: Only reload the idct coeffs for the iadst_idct combination · 3c87039a

Martin Storsjö authored 8 years ago

This avoids reloading them if they haven't been clobbered, if the
first pass also was idct.

This is similar to what was done in the aarch64 version.
Signed-off-by: Martin Storsjö <martin@martin.st>

3c87039a

23 Nov, 2016 2 commits

arm: vp9itxfm: Rename a macro parameter to fit better · 79566ec8

Martin Storsjö authored 8 years ago

Since the same parameter is used for both input and output,
the name inout is more fitting.

This matches the naming used below in the dmbutterfly macro.
Signed-off-by: Martin Storsjö <martin@martin.st>

79566ec8

arm/aarch64: vp9itxfm: Fix indentation of macro arguments · 721bc375
Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
721bc375

18 Nov, 2016 1 commit

arm: vp9itxfm: Simplify the stack alignment code · e5b0fc17

Janne Grunau authored 8 years ago

This is one instruction less for thumb, and only have got
1/2 arm/thumb specific instructions.
Signed-off-by: Martin Storsjö <martin@martin.st>

e5b0fc17

15 Nov, 2016 1 commit

arm: vp9: Add NEON itxfm routines · b4dc7c34

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

For the transforms up to 8x8, we can fit all the data (including
temporaries) in registers and just do a straightforward transform
of all the data. For 16x16, we do a transform of 4x16 pixels in
4 slices, using a temporary buffer. For 32x32, we transform 4x32
pixels at a time, in two steps of 4x16 pixels each.

Examples of relative speedup compared to the C version, from checkasm:
                         Cortex       A7     A8     A9    A53
vp9_inv_adst_adst_4x4_add_neon:     3.39   5.83   4.17   4.01
vp9_inv_adst_adst_8x8_add_neon:     3.79   4.86   4.23   3.98
vp9_inv_adst_adst_16x16_add_neon:   3.33   4.36   4.11   4.16
vp9_inv_dct_dct_4x4_add_neon:       4.06   6.16   4.59   4.46
vp9_inv_dct_dct_8x8_add_neon:       4.61   6.01   4.98   4.86
vp9_inv_dct_dct_16x16_add_neon:     3.35   3.44   3.36   3.79
vp9_inv_dct_dct_32x32_add_neon:     3.89   3.50   3.79   4.42
vp9_inv_wht_wht_4x4_add_neon:       3.22   5.13   3.53   3.77

Thus, the speedup vs C code is around 3-6x.

This is mostly marginally faster than the corresponding routines
in libvpx on most cores, tested with their 32x32 idct (compared to
vpx_idct32x32_1024_add_neon). These numbers are slightly in libvpx's
favour since their version doesn't clear the input buffer like ours
do (although the effect of that on the total runtime probably is
negligible.)

                           Cortex       A7       A8       A9      A53
vp9_inv_dct_dct_32x32_add_neon:    18436.8  16874.1  14235.1  11988.9
libvpx vpx_idct32x32_1024_add_neon 20789.0  13344.3  15049.9  13030.5

Only on the Cortex A8, the libvpx function is faster. On the other cores,
ours is slightly faster even though ours has got source block clearing
integrated.

This is an adapted cherry-pick from libav commits
a67ae670 and
52d196fb.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>

b4dc7c34

13 Nov, 2016 1 commit
- arm: vp9itxfm: Simplify txfm string comparisons · 52d196fb
  Martin Storsjö authored 8 years ago
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
  52d196fb
11 Nov, 2016 1 commit

arm: vp9: Add NEON itxfm routines · a67ae670

Martin Storsjö authored 8 years ago

This work is sponsored by, and copyright, Google.

For the transforms up to 8x8, we can fit all the data (including
temporaries) in registers and just do a straightforward transform
of all the data. For 16x16, we do a transform of 4x16 pixels in
4 slices, using a temporary buffer. For 32x32, we transform 4x32
pixels at a time, in two steps of 4x16 pixels each.

Examples of relative speedup compared to the C version, from checkasm:
                         Cortex       A7     A8     A9    A53
vp9_inv_adst_adst_4x4_add_neon:     3.39   5.83   4.17   4.01
vp9_inv_adst_adst_8x8_add_neon:     3.79   4.86   4.23   3.98
vp9_inv_adst_adst_16x16_add_neon:   3.33   4.36   4.11   4.16
vp9_inv_dct_dct_4x4_add_neon:       4.06   6.16   4.59   4.46
vp9_inv_dct_dct_8x8_add_neon:       4.61   6.01   4.98   4.86
vp9_inv_dct_dct_16x16_add_neon:     3.35   3.44   3.36   3.79
vp9_inv_dct_dct_32x32_add_neon:     3.89   3.50   3.79   4.42
vp9_inv_wht_wht_4x4_add_neon:       3.22   5.13   3.53   3.77

Thus, the speedup vs C code is around 3-6x.

This is mostly marginally faster than the corresponding routines
in libvpx on most cores, tested with their 32x32 idct (compared to
vpx_idct32x32_1024_add_neon). These numbers are slightly in libvpx's
favour since their version doesn't clear the input buffer like ours
do (although the effect of that on the total runtime probably is
negligible.)

                           Cortex       A7       A8       A9      A53
vp9_inv_dct_dct_32x32_add_neon:    18436.8  16874.1  14235.1  11988.9
libvpx vpx_idct32x32_1024_add_neon 20789.0  13344.3  15049.9  13030.5

Only on the Cortex A8, the libvpx function is faster. On the other cores,
ours is slightly faster even though ours has got source block clearing
integrated.
Signed-off-by: Martin Storsjö <martin@martin.st>

a67ae670