• Sebastian Pop's avatar
    swscale/aarch64: use multiply accumulate and increase vector factor to 4 · bd831912
    Sebastian Pop authored
    This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate
    and bumps the vectorization factor from 2 to 4.
    The speedup is of 25% on Graviton1 A1 instances based on A-72 cpus:
    
    $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
    before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214
    after:  t:0.032168 avg:0.032215 max:0.033081 min:0.032146
    
    The speedup is of 39% on Graviton2 m6g instances based on Neoverse-N1 cpus:
    $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
    before: t:0.019446 avg:0.019423 max:0.019493 min:0.019181
    after:  t:0.014015 avg:0.014096 max:0.015018 min:0.013971
    
    Tested with `make check` on aarch64-linux.
    Signed-off-by: 's avatarSebastian Pop <spop@amazon.com>
    Reviewed-by: 's avatarJean-Baptiste Kempf <jb@videolan.org>
    Signed-off-by: 's avatarMichael Niedermayer <michael@niedermayer.cc>
    bd831912
hscale.S 5.84 KB