- 05 Dec, 2014 1 commit
-
-
Kieran Kunhya authored
Signed-off-by:
Michael Niedermayer <michaelni@gmx.at> Signed-off-by:
Vittorio Giovara <vittorio.giovara@gmail.com>
-
- 26 Nov, 2014 1 commit
-
-
Kieran Kunhya authored
Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 28 Sep, 2014 1 commit
-
-
Michael Niedermayer authored
Reviewed-by:
Paul B Mahol <onemda@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 27 Sep, 2014 1 commit
-
-
lvqcl authored
Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 09 Sep, 2014 3 commits
-
-
Henrik Gramner authored
Previously there was a limit of two cpuflags. Signed-off-by:
Diego Biurrun <diego@biurrun.de>
-
Loren Merritt authored
Signed-off-by:
Diego Biurrun <diego@biurrun.de>
-
Henrik Gramner authored
This makes more sense for future implementations of templates with zmm registers. Signed-off-by:
Diego Biurrun <diego@biurrun.de>
-
- 05 Sep, 2014 1 commit
-
-
Henrik Gramner authored
Previously there was a limit of two cpuflags. Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 04 Sep, 2014 2 commits
-
-
Henrik Gramner authored
This makes more sense for future implementations of templates with zmm registers. Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
Loren Merritt authored
Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 23 Aug, 2014 2 commits
-
-
Clément Bœsch authored
501 to 439 decicycles. See 45c7f399.
-
Clément Bœsch authored
~560 → ~500 decicycles This is following the comments from Michael in https://ffmpeg.org/pipermail/ffmpeg-devel/2014-August/160599.html Using 2 registers for accumulator didn't help. On the other hand, some re-ordering between the movs and psadbw allowed going ~538 to ~500.
-
- 09 Aug, 2014 1 commit
-
-
Michael Niedermayer authored
Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 05 Aug, 2014 1 commit
-
-
Clément Bœsch authored
-
- 03 Aug, 2014 1 commit
-
-
James Almer authored
Up to four instructions less depending on function and instruction set. Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 26 Jul, 2014 1 commit
-
-
James Almer authored
Only 8-bit and 10-bit idct_dc() functions are included (adding others should be trivial). Benchmarks on an Intel Core i5-4200U: idct8x8_dc SSE2 MMXEXT C cycles 22 26 57 idct16x16_dc AVX2 SSE2 C cycles 27 32 249 idct32x32_dc AVX2 SSE2 C cycles 62 126 1375 Signed-off-by:
James Almer <jamrial@gmail.com> Reviewed-by:
Mickaël Raulet <mraulet@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 01 Jul, 2014 1 commit
-
-
Diego Biurrun authored
-
- 15 Jun, 2014 1 commit
-
-
Christophe Gisquet authored
Those macros take a byte number as shift argument, as this argument differs between MMX and SSE2 instructions. Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 08 Jun, 2014 3 commits
-
-
James Almer authored
It was lost during the port. Should fix fate on 3dnowext machines. Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
James Almer authored
Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
James Almer authored
tos3k-vp9-b10000.webm on a Core i5-4200U @1.6GHz 1219 decicycles in ff_vp9_ipred_dc_32x32_ssse3, 131070 runs, 2 skips 439 decicycles in ff_vp9_ipred_dc_32x32_avx2, 131070 runs, 2 skips 3570 decicycles in ff_vp9_ipred_dc_top_32x32_ssse3, 4096 runs, 0 skips 2494 decicycles in ff_vp9_ipred_dc_top_32x32_avx2, 4096 runs, 0 skips 1419 decicycles in ff_vp9_ipred_dc_left_32x32_ssse3, 16384 runs, 0 skips 717 decicycles in ff_vp9_ipred_dc_left_32x32_avx2, 16384 runs, 0 skips 2737 decicycles in ff_vp9_ipred_tm_32x32_avx, 1024 runs, 0 skips 2088 decicycles in ff_vp9_ipred_tm_32x32_avx2, 1024 runs, 0 skips 3090 decicycles in ff_vp9_ipred_v_32x32_avx, 512 runs, 0 skips 2226 decicycles in ff_vp9_ipred_v_32x32_avx2, 512 runs, 0 skips 1565 decicycles in ff_vp9_ipred_h_32x32_avx, 1024 runs, 0 skips 922 decicycles in ff_vp9_ipred_h_32x32_avx2, 1024 runs, 0 skips Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 29 May, 2014 1 commit
-
-
Christophe Gisquet authored
Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 28 May, 2014 1 commit
-
-
James Almer authored
Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 07 May, 2014 1 commit
-
-
Matt Oliver authored
Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 19 Apr, 2014 1 commit
-
-
James Almer authored
Use the xm# and ym# aliases as they remain in sync with m# after a SWAP. No actual changes to the assembly. Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 17 Apr, 2014 1 commit
-
-
James Almer authored
Also port relevant AVX2/XOP optimizations from x264 with permission to relicense to LGPL from the corresponding authors Signed-off-by:
James Almer <jamrial@gmail.com> Reviewed-by:
"Ronald S. Bultje" <rsbultje@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 16 Apr, 2014 2 commits
-
-
James Almer authored
~6% faster SSE2 performance. AVX/FMA3 are unaffected. Signed-off-by:
James Almer <jamrial@gmail.com> Reviewed-by:
Christophe Gisquet <christophe.gisquet@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
James Almer authored
The mova is unnecessary Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 25 Mar, 2014 1 commit
-
-
James Almer authored
AV_CPU_FLAG_AVX is enabled at this point only if there's OS support. Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 18 Mar, 2014 1 commit
-
-
Matt Oliver authored
Automatically change MANGLE() into named inline asm operands when direct symbol reference in inline asm are not supported. This is part of the patch-set for intel C inline asm on windows support Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 13 Mar, 2014 1 commit
-
-
James Almer authored
~7% faster than AVX Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 09 Mar, 2014 1 commit
-
-
Michael Niedermayer authored
Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 24 Feb, 2014 1 commit
-
-
James Almer authored
We need the emulation to support the cases where the first argument is the same as the fourth. To achieve this a fifth argument working as a temporary may be needed. Emulation that doesn't obey the original instruction semantics can't be in x86inc. Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 23 Feb, 2014 3 commits
-
-
James Almer authored
Based on x264 code Signed-off-by:
James Almer <jamrial@gmail.com>
-
James Almer authored
Based on x264 code Signed-off-by:
James Almer <jamrial@gmail.com>
-
James Almer authored
Signed-off-by:
James Almer <jamrial@gmail.com>
-
- 22 Feb, 2014 2 commits
-
-
James Almer authored
Based on x264 code Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
James Almer authored
Based on x264 code Signed-off-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-
- 20 Feb, 2014 1 commit
-
-
Christophe Gisquet authored
vector_fmul and vector_fmac_scalar are guaranteed that they can process in batch of 16 elements, but their SSE versions only does 8 at a time. Therefore, unroll them a bit. 299 to 261c for 256 elements in vector_fmac_scalar on Arrandale/Win64. Signed-off-by:
Janne Grunau <janne-libav@jannau.net>
-
- 15 Feb, 2014 1 commit
-
-
Christophe Gisquet authored
vector_fmul and vector_fmac_scalar are guaranteed that they can process in batch of 16 elements, but their SSE versions only does 8 at a time. Therefore, unroll them a bit. 299 to 261c for 256 elements in vector_fmac_scalar on Arrandale/Win64. Signed-off-by:
Michael Niedermayer <michaelni@gmx.at>
-