- 11 Jul, 2017 1 commit
-
-
Muhammad Faiz authored
It is redundant with costable. The first half of sintable is identical with the second half of costable. The second half of sintable is negative value of the first half of sintable. The computation is changed to handle sign of sin values, in C code and ARM assembly code. Signed-off-by:
Muhammad Faiz <mfcc64@gmail.com>
-
- 28 Jun, 2017 2 commits
-
-
Clément Bœsch authored
-
Clément Bœsch authored
The code originally pre-multiply by 2 the steps, causing the running sum of the h factors to drift away due to the lack of precision. It quickly causes an inaccuracy > 0.01. I tried diverse approaches such as multiply by 2.0 (instead of adding the value itself) without success. I'm unable to bench the impact of this change, feel free to compare. This commit fixes the incoming aacpsdsp tests. Following is an alternative simplified function (matching the incoming AArch64 code) that may be used: function ff_ps_stereo_interpolate_neon, export=1 vld1.32 {q0}, [r2] vld1.32 {q1}, [r3] ldr r12, [sp] vmov.f32 q8, q0 vmov.f32 q9, q1 vzip.32 q8, q0 vzip.32 q9, q1 1: vld1.32 {d4}, [r0,:64] vld1.32 {d6}, [r1,:64] vadd.f32 q8, q8, q9 vadd.f32 q0, q0, q1 vmov.f32 d5, d4 vmov.f32 d7, d6 vmul.f32 q2, q2, q8 vmla.f32 q2, q3, q0 vst1.32 {d4}, [r0,:64]! vst1.32 {d5}, [r1,:64]! subs r12, r12, #1 bgt 1b bx lr endfunc
-
- 15 May, 2017 1 commit
-
-
Martin Storsjö authored
clang now (in the upcoming 5.0 version) is capable of building our arm assembly without relying on gas-preprocessor, although clang/LLVM doesn't support .dn register aliases. The VC1 MC assembly was only built and used if the chosen assembler supported the .dn directives though. This was supported as long as gas-preprocessor was used. This means that VC1 decoding got a speed regression on clang 5.0, unless the user manually chose using gas-preprocessor again. By avoiding using the .dn register aliases, we can build the VC1 MC assembly with the latest clang version. Support for the .dn/.qn directives in clang/LLVM isn't actively planned, see https://bugs.llvm.org/show_bug.cgi?id=18199. This partially reverts 896a5bff. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
- 04 May, 2017 2 commits
-
-
Alexandra Hájková authored
Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Alexandra Hájková authored
This doesn't change the actual behaviour of the code but improves readability. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
- 01 May, 2017 1 commit
-
-
Alexandra Hájková authored
Signed-off-by:
Martin Storsjö <martin@martin.st>
-
- 28 Apr, 2017 1 commit
-
-
Martin Storsjö authored
Before: Cortex A7 A8 A9 A53 hevc_add_res_8x8_8_neon: 116.0 58.7 80.2 90.7 hevc_add_res_32x32_8_neon: 1230.0 737.5 1187.5 974.4 After: hevc_add_res_8x8_8_neon: 97.7 57.0 73.7 80.0 hevc_add_res_32x32_8_neon: 1216.0 698.7 1127.5 827.1 Signed-off-by:
Martin Storsjö <martin@martin.st>
-
- 27 Apr, 2017 1 commit
-
-
Seppo Tomperi authored
Optimized by Alexandra Hájková. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
- 25 Apr, 2017 2 commits
-
-
Alexandra Hájková authored
Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Seppo Tomperi authored
Signed-off-by:
Alexandra Hájková <alexandra@khirnov.net> Signed-off-by:
Martin Storsjö <martin@martin.st>
-
- 12 Apr, 2017 1 commit
-
-
Alexandra Hájková authored
The speedup vs C code is around 6-13x. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
- 06 Apr, 2017 1 commit
-
-
Ronald S. Bultje authored
Instead, hardcode the use of the _arm implementation of add_pixels, and use the C version for put_pixels (as no arm-optimized version exists). Since there's separate implementations of idct{,_put,_add} for neon, this has no practical impact on performance.
-
- 28 Mar, 2017 3 commits
-
-
Ronald S. Bultje authored
This allows vp9dsp.h to only include the VP9 types header, and not the decoder skeleton interface which is for hardware decoders (dxva2/vaapi).
-
Ronald S. Bultje authored
The advantage here is that the internal software decoder interface is not exposed to the DSP functions or the hardware accelerations.
-
Martin Storsjö authored
The main hevcdsp.c file calls this init function if HAVE_ARM is set, regardless of whether neon support is available or not. This fixes builds where neon isn't supported by the build tools at all. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
- 27 Mar, 2017 2 commits
-
-
Alexandra Hájková authored
Optimized by Martin Storsjö <martin@martin.st>. The speedup vs C code is around 3.2-4.4x. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Clément Bœsch authored
This is following Libav layout to ease merges.
-
- 20 Mar, 2017 1 commit
-
-
Clément Bœsch authored
-
- 19 Mar, 2017 8 commits
-
-
Martin Storsjö authored
This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 14516 bytes to 22484 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 270.7 418.5 295.4 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 3840.2 3244.8 3700.1 2337.9 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4212.5 3575.4 3996.9 2571.6 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 5174.4 4270.5 4615.5 3031.9 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5676.0 4908.5 5226.5 3491.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6403.9 5589.0 5839.8 3948.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1710.7 944.7 1582.1 1045.4 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 21040.7 16706.1 18687.7 13193.1 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22197.7 18282.7 19577.5 13918.6 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 24511.5 20911.5 21472.5 15367.5 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 26939.5 24264.3 23239.1 16830.3 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 29419.5 26845.1 25020.6 18259.9 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31146.4 29633.5 26803.3 19721.7 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33376.3 32507.8 28642.4 21174.2 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35629.4 35439.6 30416.5 22625.7 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37269.9 37914.9 32271.9 24078.9 After: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 276.0 418.5 295.1 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2336.2 1886.0 2251.0 1458.6 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 2531.0 2054.7 2402.8 1591.1 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 3848.6 3491.1 3845.7 2554.8 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5703.8 4831.6 5230.8 3493.4 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6399.5 5567.0 5832.4 3951.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1722.1 938.5 1577.3 1044.5 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 15003.5 11576.8 13105.8 9602.2 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 15768.5 12677.2 13726.0 10138.1 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 17278.8 14825.4 14907.5 11185.7 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 22335.7 21544.5 20379.5 15019.8 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 24165.6 23881.7 21938.6 16308.2 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31082.2 30860.9 26835.3 19711.3 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33102.6 31922.8 28638.3 21161.0 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35104.9 34867.5 30411.7 22621.2 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37438.1 39103.4 32217.8 24067.6 Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from 17500 to 14516 bytes. This gives a small slowdown of a couple tens of cycles, up to around 150 cycles for the full case of the largest transform, but makes it more feasible to add more optimized versions of these transforms. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4237.4 3561.5 3971.8 2525.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6371.9 5452.0 5779.3 3910.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22068.8 17867.5 19555.2 13871.6 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37268.9 38684.2 32314.2 23969.0 After: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4375.1 3571.9 4283.8 2567.2 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6415.6 5578.9 5844.6 3948.3 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.7 18079.7 19603.7 13905.3 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37593.2 38862.2 32235.8 24070.9 Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
Keep the idct32 coefficients in narrow form in q6-q7, and idct16 coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering q0-q3 in the pass1 function, and squeeze the idct16 coefficients into q0-q1 in the pass2 function to avoid reloading them. The idct16 coefficients are clobbered and reloaded within idct32_odd though, since that turns out to be faster than narrowing them and swapping them into q6-q7. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.8 18268.4 19598.0 14079.0 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37699.0 38665.2 32542.3 24472.2 After: vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22270.8 18159.3 19531.0 13865.0 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37523.3 37731.6 32181.7 24071.2 Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This makes the code slightly clearer, but doesn't make any functional difference. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. This is cherrypicked from libav commit 7995ebfa. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. This is cherrypicked from libav commit 3a0d5e20. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This reduces the number of lines and reduces the duplication. Also simplify the eob check for the half case. If we are in the half case, we know we at least will need to do the first three slices, we only need to check eob for the fourth one, so we can hardcode the value to check against instead of loading from the min_eob array. Since at most one slice can be skipped in the first pass, we can unroll the loop for filling zeros completely, as it was done for the quarter case before. This allows skipping loading the min_eob pointer when using the quarter/half cases. This is cherrypicked from libav commit 98ee855a. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
- 16 Mar, 2017 1 commit
-
-
Martin Storsjö authored
Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
- 11 Mar, 2017 12 commits
-
-
Martin Storsjö authored
In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This reduces the number of lines and reduces the duplication. Also simplify the eob check for the half case. If we are in the half case, we know we at least will need to do the first three slices, we only need to check eob for the fourth one, so we can hardcode the value to check against instead of loading from the min_eob array. Since at most one slice can be skipped in the first pass, we can unroll the loop for filling zeros completely, as it was done for the quarter case before. This allows skipping loading the min_eob pointer when using the quarter/half cases. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. This is cherrypicked from libav commit 08074c09. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. This is cherrypicked from libav commit de06bdfe. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
The idct32x32 function actually pushed q4-q7 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. Since the idct16 core transform avoids clobbering q4-q7 (but clobbers q2-q3 instead, to avoid needing to back up and restore q4-q7 at all in the idct16 function), and the lanewise vmul needs a register in the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5 while doing idct16. While keeping these coefficients in registers, we still can skip pushing q7. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_sub32_add_neon: 18553.8 17182.7 14303.3 12089.7 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 18470.3 16717.7 14173.6 11860.8 This is cherrypicked from libav commit 402546a1. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
For this case, with 8 inputs but only changing 4 of them, we can fit all 16 input pixels into a q register, and still have enough temporary registers for doing the loop filter. The wd=8 filters would require too many temporary registers for processing all 16 pixels at once though. Before: Cortex A7 A8 A9 A53 vp9_loop_filter_mix2_v_44_16_neon: 289.7 256.2 237.5 181.2 After: vp9_loop_filter_mix2_v_44_16_neon: 221.2 150.5 177.7 138.0 This is cherrypicked from libav commit 575e31e9. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
The theoretical maximum value of E is 193, so we can just saturate the addition to 255. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 After: vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.6 84.0 83.0 vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0 133.7 vp9_loop_filter_v_16_8_neon: 490.0 417.5 377.7 289.0 271.0 vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0 446.7 This is cherrypicked from libav commit c582cb85. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. This is cherrypicked from libav commit e18c3900. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This is cherrypicked from libav commit 435cd7bc. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
Previously we first calculated hev, and then negated it. Since we were able to schedule the negation in the middle of another calculation, we don't see any gain in all cases. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7 vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0 After: vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 This is cherrypicked from libav commit e1f9de86. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This work is sponsored by, and copyright, Google. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 211.7 235.8 vp9_inv_dct_dct_32x32_sub1_add_neon: 752.0 459.2 862.2 553.9 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 226.5 145.0 225.1 171.8 vp9_inv_dct_dct_32x32_sub1_add_neon: 721.2 415.7 727.6 475.0 This is cherrypicked from libav commit a76bf8cf. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
Before: Cortex A7 A8 A9 A53 vp9_put_8tap_smooth_4h_neon: 378.1 273.2 340.7 229.5 After: vp9_put_8tap_smooth_4h_neon: 352.1 222.2 290.5 229.5 This is cherrypicked from libav commit fea92a4b. Signed-off-by:
Martin Storsjö <martin@martin.st>
-