-
Martin Storsjö authored
The previous version was a pretty exact translation of the arm version. This version does do some unnecessary arithemetic (it does more operations on vectors that are only half filled; it does 4 uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead of packing data together (which could be done for free in the arm version). This gives a decent speedup on Cortex A53, a minor speedup on A72 and a very minor slowdown on Cortex A73. Before: Cortex A53 A72 A73 vp8_idct_add_neon: 79.7 67.5 65.0 After: vp8_idct_add_neon: 67.7 64.8 66.7 Signed-off-by: Martin Storsjö <martin@martin.st>
7e42d5f0