- 27 Mar, 2017 1 commit
-
-
Clément Bœsch authored
This is following Libav layout to ease merges.
-
- 14 Jan, 2017 1 commit
-
-
Martin Storsjö authored
This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0 vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3 By skipping individual 4x16 or 4x32 pixel slices in the first pass, we reduce the runtime of these functions like this: vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6 189.5 211.7 235.8 vp9_inv_dct_dct_16x16_sub2_add_neon: 2064.0 1534.8 1719.4 1248.7 vp9_inv_dct_dct_16x16_sub4_add_neon: 2135.0 1477.2 1736.3 1249.5 vp9_inv_dct_dct_16x16_sub8_add_neon: 2446.7 1828.7 1993.6 1494.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2 456.7 862.0 553.9 vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8 vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5 vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2 vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6 vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6 vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7 vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1 I.e. in general a very minor overhead for the full subpartition case due to the additional loads and cmps, but a significant speedup for the cases when we only need to process a small part of the actual input data. In common VP9 content in a few inspected clips, 70-90% of the non-dc-only 16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left 8x8 or 16x16 subpartitions respectively. This is cherrypicked from libav commit 9c8bc74c. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
-
- 27 Dec, 2016 1 commit
-
-
Ronald S. Bultje authored
-
- 30 Nov, 2016 1 commit
-
-
Martin Storsjö authored
This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0 vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3 By skipping individual 4x16 or 4x32 pixel slices in the first pass, we reduce the runtime of these functions like this: vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6 189.5 211.7 235.8 vp9_inv_dct_dct_16x16_sub2_add_neon: 2064.0 1534.8 1719.4 1248.7 vp9_inv_dct_dct_16x16_sub4_add_neon: 2135.0 1477.2 1736.3 1249.5 vp9_inv_dct_dct_16x16_sub8_add_neon: 2446.7 1828.7 1993.6 1494.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2 456.7 862.0 553.9 vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8 vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5 vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2 vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6 vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6 vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7 vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1 I.e. in general a very minor overhead for the full subpartition case due to the additional loads and cmps, but a significant speedup for the cases when we only need to process a small part of the actual input data. In common VP9 content in a few inspected clips, 70-90% of the non-dc-only 16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left 8x8 or 16x16 subpartitions respectively. Signed-off-by: Martin Storsjö <martin@martin.st>
-
- 23 Nov, 2016 2 commits
-
-
Ronald S. Bultje authored
Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This reverts commit 81d7f0bb. Instead of just benchmarking dc separately, test all relevant subparts (in the next commit). Signed-off-by: Martin Storsjö <martin@martin.st>
-
- 16 Nov, 2016 1 commit
-
-
Martin Storsjö authored
The dc-only mode is already checked to work correctly above, but this allows benchmarking this mode for performance tuning, and allows making sure that it actually is correctly hooked up. Signed-off-by: Martin Storsjö <martin@martin.st>
-
- 11 Nov, 2016 1 commit
-
-
Ronald S. Bultje authored
This includes fixes by Henrik Gramner. The forward transforms are derived from the reference encoder. Signed-off-by: Martin Storsjö <martin@martin.st>
-
- 03 Nov, 2016 1 commit
-
-
Martin Storsjö authored
This makes it match the pattern already used for VP8 MC functions. This also makes the signature match ffmpeg's version of these functions, easing porting of code in both directions. Signed-off-by: Martin Storsjö <martin@martin.st>
-
- 04 Oct, 2016 1 commit
-
-
Ronald S. Bultje authored
The randomize_buffer() implementation assures that "most of the time", we'll do a good mix of wide16/wide8/hev/regular/no filters for complete code coverage. However, this is not mathematically assured because that would make the code either much more complex, or much less random. Some fixes and improvements by Rodger Combs <rodger.combs@gmail.com> Signed-off-by: Anton Khirnov <anton@khirnov.net>
-
- 03 Aug, 2016 1 commit
-
-
Ronald S. Bultje authored
Signed-off-by: Anton Khirnov <anton@khirnov.net>
-
- 27 Jul, 2016 1 commit
-
-
James Almer authored
Fixes checkasm failures on mmxext functions Signed-off-by: James Almer <jamrial@gmail.com>
-
- 13 Oct, 2015 1 commit
-
-
Ronald S. Bultje authored
These aren't quite as helpful as the ones in 8bpp, since over there, we can use pmulhrsw, but here the coefficients have too many bits to be able to take advantage of pmulhrsw. However, we can still skip cols for which all coefs are 0, and instead just zero the input data for the row itx. This helps a few % on overall decoding speed.
-
- 28 Sep, 2015 2 commits
-
-
Henrik Gramner authored
-
Ronald S. Bultje authored
-
- 26 Sep, 2015 3 commits
-
-
James Almer authored
Reviewed-by: Henrik Gramner <henrik@gramner.com> Signed-off-by: James Almer <jamrial@gmail.com>
-
Ronald S. Bultje authored
-
Rodger Combs authored
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
-
- 24 Sep, 2015 1 commit
-
-
Michael Niedermayer authored
The change was wrong, also add a comment explaining it Found-by: BBB Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
-
- 22 Sep, 2015 1 commit
-
-
Ronald S. Bultje authored
(I forgot to actually merge them into the patch I just pushed.)
-
- 20 Sep, 2015 3 commits
-
-
Rodger Combs authored
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
-
Michael Niedermayer authored
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
-
Ronald S. Bultje authored
The randomize_buffer() implementation assures that "most of the time", we'll do a good mix of wide16/wide8/hev/regular/no filters for complete code coverage. However, this is not mathematically assured because that would make the code either much more complex, or much less random.
-
- 16 Sep, 2015 1 commit
-
-
Michael Niedermayer authored
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
-
- 15 Sep, 2015 2 commits
-
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-