- 19 Mar, 2017 40 commits
-
-
Martin Storsjö authored
This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 14516 bytes to 22484 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 270.7 418.5 295.4 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 3840.2 3244.8 3700.1 2337.9 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4212.5 3575.4 3996.9 2571.6 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 5174.4 4270.5 4615.5 3031.9 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5676.0 4908.5 5226.5 3491.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6403.9 5589.0 5839.8 3948.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1710.7 944.7 1582.1 1045.4 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 21040.7 16706.1 18687.7 13193.1 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22197.7 18282.7 19577.5 13918.6 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 24511.5 20911.5 21472.5 15367.5 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 26939.5 24264.3 23239.1 16830.3 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 29419.5 26845.1 25020.6 18259.9 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31146.4 29633.5 26803.3 19721.7 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33376.3 32507.8 28642.4 21174.2 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35629.4 35439.6 30416.5 22625.7 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37269.9 37914.9 32271.9 24078.9 After: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 276.0 418.5 295.1 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2336.2 1886.0 2251.0 1458.6 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 2531.0 2054.7 2402.8 1591.1 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 3848.6 3491.1 3845.7 2554.8 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5703.8 4831.6 5230.8 3493.4 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6399.5 5567.0 5832.4 3951.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1722.1 938.5 1577.3 1044.5 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 15003.5 11576.8 13105.8 9602.2 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 15768.5 12677.2 13726.0 10138.1 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 17278.8 14825.4 14907.5 11185.7 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 22335.7 21544.5 20379.5 15019.8 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 24165.6 23881.7 21938.6 16308.2 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31082.2 30860.9 26835.3 19711.3 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33102.6 31922.8 28638.3 21161.0 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35104.9 34867.5 30411.7 22621.2 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37438.1 39103.4 32217.8 24067.6 Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This allows reusing the macro for a separate implementation of the pass2 function. Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from 26288 to 21512 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1887.4 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2801.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9691.4 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16154.9 After: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1899.5 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2827.2 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9714.7 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16175.9 Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from 17500 to 14516 bytes. This gives a small slowdown of a couple tens of cycles, up to around 150 cycles for the full case of the largest transform, but makes it more feasible to add more optimized versions of these transforms. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4237.4 3561.5 3971.8 2525.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6371.9 5452.0 5779.3 3910.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22068.8 17867.5 19555.2 13871.6 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37268.9 38684.2 32314.2 23969.0 After: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4375.1 3571.9 4283.8 2567.2 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6415.6 5578.9 5844.6 3948.3 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.7 18079.7 19603.7 13905.3 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37593.2 38862.2 32235.8 24070.9 Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This makes the code a bit more readable. Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
Keep the idct32 coefficients in narrow form in q6-q7, and idct16 coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering q0-q3 in the pass1 function, and squeeze the idct16 coefficients into q0-q1 in the pass2 function to avoid reloading them. The idct16 coefficients are clobbered and reloaded within idct32_odd though, since that turns out to be faster than narrowing them and swapping them into q6-q7. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.8 18268.4 19598.0 14079.0 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37699.0 38665.2 32542.3 24472.2 After: vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22270.8 18159.3 19531.0 13865.0 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37523.3 37731.6 32181.7 24071.2 Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This makes the code slightly clearer, but doesn't make any functional difference. Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. This is cherrypicked from libav commit 7995ebfa. Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. This is cherrypicked from libav commit 3a0d5e20. Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This reduces the number of lines and reduces the duplication. Also simplify the eob check for the half case. If we are in the half case, we know we at least will need to do the first three slices, we only need to check eob for the fourth one, so we can hardcode the value to check against instead of loading from the min_eob array. Since at most one slice can be skipped in the first pass, we can unroll the loop for filling zeros completely, as it was done for the quarter case before. This allows skipping loading the min_eob pointer when using the quarter/half cases. This is cherrypicked from libav commit 98ee855a. Signed-off-by: Martin Storsjö <martin@martin.st>
-
James Almer authored
* commit '4ab49626': libvpx: Cast a pointer to const to squelch a warning This commit is a noop, see 09b3bbe6Merged-by: James Almer <jamrial@gmail.com>
-
James Almer authored
* commit '721d57e6': vp56: Separate VP5 and VP6 dsp initialization Merged-by: James Almer <jamrial@gmail.com>
-
James Almer authored
* commit '3fd22538': prores: Change type of stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>
-
James Almer authored
* commit 'f81be06c': cavs: Change type of stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>
-
James Almer authored
* commit '802727b5': vp8: Update some assembly comments left unchanged in bd66f073Merged-by: James Almer <jamrial@gmail.com>
-
James Almer authored
* commit '87c6c786': vp8: Change type of stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>
-
James Almer authored
* commit 'd9d26a36': vp56: Change type of stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>
-
Clément Bœsch authored
* commit '6892df92': vp3: Change type of stride parameters to ptrdiff_t Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '963b3ab1': doc: Document FATE option HWACCEL Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit 'd42809f9': av1: Add codec_id and basic demuxing support Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '24130234': rtpdec_mpeg4: validate fmtp fields Merged with fixed log message. Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '46e3936f': configure: Set __MSVCRT_VERSION__to 0x0700 for MinGW Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '6755eb5b': mss12: validate display dimensions This commit is a noop, see ee9151b6Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '33f10546': vc1: check that slices have a positive height This commit is a noop, see e985cfd1Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '09b23786': pcx: use the bytestream2 API for reading from input This commit is a noop, see 8cd1c0feMerged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '221402c1': pcx: check that the packet is large enough before reading the header See 8cd1c0feMerged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '15ee419b': pcx: properly pad the scanline This commit is a noop, see d24de459Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '409d1cd2': cook: use the bytestream2 API for reading extradata Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit 'bba9d8bd': qpeg: fix an off by 1 error in the MV check See dd3bfe3cMerged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '796dca02': alac: do not return success if nothing was decoded See e11983bdMerged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit 'f5d46d33': vmnc: check that subrectangles fit into their containing rectangles See 6ba02602 This merge keeps our condition against w-i and h-j instead of bw and bh. One may be more correct than the other, but I'm keeping our behaviour here for safety reasons. The style and formatting is merged. Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '83b92a85': golomb: Drop disabled cruft Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '014852e9': simple_idct: arm: Drop disabled code variant Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit 'e2b99935': simple_idct: x86: Drop disabled IDCT implementation Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '7effebde': dvbsubdec: Remove disabled, near-duplicate debug code Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '93fed46a': timefilter: test: Drop some disabled debug cruft Merged-by: Clément Bœsch <u@pkh.me>
-
Clément Bœsch authored
* commit '0e285c2f': mpegvideo: Kill some disabled code Merged-by: Clément Bœsch <u@pkh.me>
-