Commits · de50785ed56aea3b0cbf8703b3d1a9a0ae26374b · Linshizhi / V8

17 Dec, 2020 3 commits

[wasm-simd][x64] Optimize arch shuffle if AVX supported · 46ce9b05

Zhi An Ng authored 4 years ago

AVX has 3-operands shuffle/unpack operations. We currently always
require that dst == src0 in all cases, which is not required if we have
AVX. For the arch shuffles that map to a single native instruction, add
support to check for AVX in the instruction-selector, to not require
same as first, and in the code-gen to support generating AVX.

The other arch shuffles are slightly more complicated, and can be
optimized in a future change.

Bug: v8:11270
Change-Id: I25b271aeff71fbe860d5bcc8abb17c36bcdab32c
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2591858Reviewed-by: Bill Budge <bbudge@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71820}

46ce9b05

[x64][wasm-simd] Only require unique registers for shuffles that use temps · 12e37399

Zhi An Ng authored 4 years ago

An improvement to generic shuffle improvement
(https://crrev.com/c/2152853) required a temporary SIMD register to hold
the mask, rather than pushing it onto a stack. The temporary register
requires that we UseUniqueRegister on the inputs, to prevent aliasing,
as we will write to the temp. However, we only need this for the generic
shuffle. We accidentally over-constraint all other pattern matched
shuffles, since they don't use any temps.

On a ~2000 line function containing ~150 shuffles (not all of which are
generic shuffles), we get 16 less instruction in the native code, and
actually see a very small improvement in the overall benchmarks.

Bug: v8:11270
Change-Id: I09974f7615e4b8f5e2416ed17ca47cc7613fd6b1
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2591857Reviewed-by: Bill Budge <bbudge@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71818}

12e37399

[wasm-simd][ia32][x64] More optimization for f32x4.extract_lane · 741e5a66

Zhi An Ng authored 4 years ago

We can have more optimizations for this instruction, they leave some
junk in the top lanes of dst, but that doesn't matter:

- when lane is 1: we use movshdup, this is 4 bytes long
- when lane is 2: use movhlps, this is 3 bytes long
- otherwise use shufps (4 bytes) or pshufd (5 bytes)

All of which are better than insertps (6 bytes).

Change-Id: I0e524431d1832e297e8c8bb418d42382d93fa691
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2591850
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Reviewed-by: Bill Budge <bbudge@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71813}

741e5a66

16 Dec, 2020 4 commits

Reland "[Turboprop] Move dynamic check maps immediate args to deopt exit." · 7bdb0fbb

Ross McIlroy authored 4 years ago

This is a reland of b2a611d8

Original change's description:
> [Turboprop] Move dynamic check maps immediate args to deopt exit.
>
> Rather than loading the immediate arguments required by the
> dynamic check maps builtin into registers in the fast-path,
> instead insert them into the instruction stream in the deopt
> exit and have the builtin load them into registers itself.
>
> BUG=v8:10582
>
> Change-Id: I66716570b408501374eed8f5e6432df64c6deb7c
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2589736
> Commit-Queue: Ross McIlroy <rmcilroy@chromium.org>
> Reviewed-by: Sathya Gunasekaran  <gsathya@chromium.org>
> Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#71790}

TBR=tebbi@chromium.org,gsathya@chromium.org

Bug: v8:10582
Change-Id: Ieda0295ee135bff983c67c3f04bb47115f0a2739
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2595311Reviewed-by: Ross McIlroy <rmcilroy@chromium.org>
Commit-Queue: Ross McIlroy <rmcilroy@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71803}

7bdb0fbb

Revert "[Turboprop] Move dynamic check maps immediate args to deopt exit." · a1ec77e6

Clemens Backes authored 4 years ago

This reverts commit b2a611d8.

Reason for revert: Several failures on https://ci.chromium.org/ui/p/v8/builders/ci/V8%20Linux%20-%20arm64%20-%20sim%20-%20CFI/3743/overview

Original change's description:
> [Turboprop] Move dynamic check maps immediate args to deopt exit.
>
> Rather than loading the immediate arguments required by the
> dynamic check maps builtin into registers in the fast-path,
> instead insert them into the instruction stream in the deopt
> exit and have the builtin load them into registers itself.
>
> BUG=v8:10582
>
> Change-Id: I66716570b408501374eed8f5e6432df64c6deb7c
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2589736
> Commit-Queue: Ross McIlroy <rmcilroy@chromium.org>
> Reviewed-by: Sathya Gunasekaran  <gsathya@chromium.org>
> Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#71790}

TBR=rmcilroy@chromium.org,gsathya@chromium.org,tebbi@chromium.org

Change-Id: I4c56bee156ffcea8de0aeaff9ac1bf03e03134c9
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: v8:10582
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2595308Reviewed-by: Clemens Backes <clemensb@chromium.org>
Commit-Queue: Clemens Backes <clemensb@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71793}

a1ec77e6

[Turboprop] Move dynamic check maps immediate args to deopt exit. · b2a611d8

Ross McIlroy authored 4 years ago

Rather than loading the immediate arguments required by the
dynamic check maps builtin into registers in the fast-path,
instead insert them into the instruction stream in the deopt
exit and have the builtin load them into registers itself.

BUG=v8:10582

Change-Id: I66716570b408501374eed8f5e6432df64c6deb7c
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2589736
Commit-Queue: Ross McIlroy <rmcilroy@chromium.org>
Reviewed-by: Sathya Gunasekaran  <gsathya@chromium.org>
Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71790}

b2a611d8

[wasm-simd][x64] Fix definition of Shufps · 5f4b0e47

Zhi An Ng authored 4 years ago

The definition of Shufps is wrong, we are incorrectly passing 0 as the
immediate in all cases. No tests broke because we only used Shufps for
splats, which has imm8 == 0 anyway.

Also, it was using movss, which only moves a single 32-bit. Because we
were using it only for f32x4 splat, this ended up being enough (imm8 ==
0 meant that we only shuffled the low 32-bit). This is fixed to use
movaps, which moves the entire 128-bit register.

Also tweak the definition of Shufps to take 4 arguments. `vshufps dst,
src1, src2, imm8` shuffles src1 and src2 into dst. `shufps dst, src,
imm8`, shuffles dst and src into dst.

So `Shufps(dst, src, imm8)` is ambiguous in the AVX case, it could be:
1. vshufps(dst, src, src, imm8), or
2. vshufps(dst, dst, src, imm8)

2. is more likely to be the intended behavior, but it introduces a false
dependency on the value of dst.

With `Shufps(dst, src1, src2, imm8)`, it is clearer what the behavior
should be:
1. shufps(dst, src2, imm8) matches the AVX behavior IFF dst == src1.

Change-Id: I60dc4ec868023d28d00f2b09d2c53b82a729bc4d
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2591849Reviewed-by: Clemens Backes <clemensb@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71775}

5f4b0e47

15 Dec, 2020 1 commit

[x64][wasm-simd] Pattern match 32x4 rotate · 7c98abdb

Zhi An Ng authored 4 years ago

Code like:

  x = wasm_v32x4_shuffle(x, x, 1, 2, 3, 0);

is currently matched by S8x16Concat, which lowers to two instructions:

  movapd xmm_dst, xmm_src
  palignr xmm_dst, xmm_src, 0x4

There is a special case after a S8x16Concat is matched:.

- is_swizzle, the inputs are the same
- it is a 32x4 shuffle (offset % 4 == 0)

Which can have a better codegen:

- (dst == src) shufps dst, src, 0b00111001
- (dst != src) pshufd dst, src, 0b00111001

Add a new simd shuffle matcher which will match 32x4 rotate, and
construct the appropriate indices referring to the 32x4 elements.

pshufd for the given example. However, this matching happens after
S8x16Concat, so we get the palignr first. We could move the pattern
matching cases around, but it will lead to some cases where
where it would have matched a S8x16Concat, but now matches a
S32x4shuffle instead, leading to worse codegen.

Note: we also pattern match on 32x4Swizzle, which correctly generates
Change-Id: Ie3aca53bbc06826be2cf49632de4c24ec73d0a9a
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2589062Reviewed-by: Bill Budge <bbudge@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71754}

7c98abdb

14 Dec, 2020 2 commits

[wasm-simd][x64] Optimize f64x2.extract_lane · 6cb61e63

Zhi An Ng authored 4 years ago

pextrq + movq crosses register files twice, which is not efficient.

Optimize this by:
- checking if lane 0, do nothing if dst == src (macro-assembler helper)
- use vmovhlps on AVX, with src as the operands to avoid false
dependency on dst
- use movhlps otherwise, this is shorter than shufpd, and faster on
older system

Change-Id: I3486d87224c048b3229c2f92359b8b8e6d5fd025
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2589056
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Reviewed-by: Bill Budge <bbudge@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71751}

6cb61e63

[x64][wasm-simd] Optimize f32x4.extract_lane · 3ea458be

Zhi An Ng authored 4 years ago

Change the codegen for f32x4.extract_lane from shufps to insertps when
AVX is supported. They have the same performance, but shufps has a false
dependency on dst (it shuffles dst and src, but we don't care about dst
at all).

Also for SSE, extractps + movd crosses register files, so change it to
use insertps as well.

Change-Id: Idf45849d37ac3499bf3371ba2fa6ae05829aa8a7
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2589048
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Reviewed-by: Bill Budge <bbudge@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71747}

3ea458be

10 Dec, 2020 3 commits

Revert "[compiler][wasm] Align Frame slots to value size" · ba4c08a9

Bill Budge authored 4 years ago

This reverts commit cddaf66c.

Reason for revert: Multiple fuzzer failures

TBR=neis@chromium.org,ahaas@chromium.org

Original change's description:
> [compiler][wasm] Align Frame slots to value size
>
> - Adds an AlignedSlotAllocator class and tests, to unify slot
>   allocation. This attempts to use alignment holes for smaller
>   values.
> - Reworks Frame to use the new allocator for stack slots.
> - Reworks LinkageAllocator to use the new allocator for stack
>   slots and for ARMv7 FP register aliasing.
> - Fixes the RegisterAllocator to align spill slots.
> - Fixes InstructionSelector to align spill slots.
>
> Bug: v8:9198
>
> Change-Id: Ida148db428be89ef95de748ec5fc0e7b0358f523
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2512840
> Commit-Queue: Bill Budge <bbudge@chromium.org>
> Reviewed-by: Georg Neis <neis@chromium.org>
> Reviewed-by: Andreas Haas <ahaas@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#71644}

TBR=bbudge@chromium.org,neis@chromium.org,ahaas@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug: v8:9198
Change-Id: Ib26d016df6f30f333d30b5ac14eed9630bba8252
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2584200
Commit-Queue: Bill Budge <bbudge@chromium.org>
Reviewed-by: Bill Budge <bbudge@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71703}

ba4c08a9

[wasm-simd][x64] Prototype extended pairwise addition · aee85229

Zhi An Ng authored 4 years ago

Add new macro-assembler instructions that can handle both AVX and SSE.
In the SSE case it checks that dst == src1. (This is different from that
the AvxHelper does, which passes dst as the first operand to AVX
instructions.)

Sorted SSSE3_INSTRUCTION_LIST by instruction code.

Header additions are added by clangd, we were already using something
from those headers via transitive includes, adding them explicitly gets
us closer to IWYU.

Codegen sequences are from https://github.com/WebAssembly/simd/pull/380
and also
https://github.com/WebAssembly/simd/pull/380#issuecomment-707440671.

Bug: v8:11086
Change-Id: I4c04f836e471ed8b00f9ff1a1b2e6348a593d4de
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2578797
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Reviewed-by: Bill Budge <bbudge@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71688}

aee85229

[wasm-simd][x64] Prototype extended multiply · baf7e902

Zhi An Ng authored 4 years ago

Bug: v8:11008
Change-Id: Ic72e71eb10a5b47c97467bf6d25e55d20425273a
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2575784Reviewed-by: Bill Budge <bbudge@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71686}

baf7e902

07 Dec, 2020 1 commit

[compiler][wasm] Align Frame slots to value size · cddaf66c

Bill Budge authored 4 years ago

- Adds an AlignedSlotAllocator class and tests, to unify slot
  allocation. This attempts to use alignment holes for smaller
  values.
- Reworks Frame to use the new allocator for stack slots.
- Reworks LinkageAllocator to use the new allocator for stack
  slots and for ARMv7 FP register aliasing.
- Fixes the RegisterAllocator to align spill slots.
- Fixes InstructionSelector to align spill slots.

Bug: v8:9198

Change-Id: Ida148db428be89ef95de748ec5fc0e7b0358f523
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2512840
Commit-Queue: Bill Budge <bbudge@chromium.org>
Reviewed-by: Georg Neis <neis@chromium.org>
Reviewed-by: Andreas Haas <ahaas@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71644}

cddaf66c

03 Dec, 2020 1 commit

[wasm-simd][x64] Small optimization for i64x2.splat · f6291634

Zhi An Ng authored 4 years ago

Movddup can take a memory operand, so we can save a move from gp reg to
xmm reg in that case. No problem with unaligned memory since we are
loading 64 bits (not 128 bits).

Also drive-by comment on i32x4.splat, it uses pshufd, which can also take
a memory operand (saving a mov), but we need aligned memory for that
first.

Bug: v8:9198
Change-Id: I55969888db1debb6ed4d193f767589d0da598386
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2567538Reviewed-by: Bill Budge <bbudge@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71580}

f6291634

01 Dec, 2020 1 commit

[compiler] Adjust slot calculations for return slots. · 366e5e24

Bill Budge authored 4 years ago

- Uses linkage location information, to keep in sync with how
  LinkageAllocator and Frame work to assign stack slots.

Bug: v8:9198

Change-Id: I299038e4cff706355263f00603ba32515449fefe
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2556259Reviewed-by: Maya Lekova <mslekova@chromium.org>
Reviewed-by: Andreas Haas <ahaas@chromium.org>
Reviewed-by: Thibaud Michaud <thibaudm@chromium.org>
Commit-Queue: Bill Budge <bbudge@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71532}

366e5e24

17 Nov, 2020 1 commit

Replace libc functions with base wrappers · ba681fdb

John Xu authored 4 years ago

Bug: v8:10927
Change-Id: Icbdc0d7329ddd466e7d67a954246a35795b4dece
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2507310
Commit-Queue: Ulan Degenbaev <ulan@chromium.org>
Reviewed-by: Peter Marshall <petermarshall@chromium.org>
Reviewed-by: Michael Lippautz <mlippautz@chromium.org>
Reviewed-by: Clemens Backes <clemensb@chromium.org>
Reviewed-by: Toon Verwaest <verwaest@chromium.org>
Reviewed-by: Ulan Degenbaev <ulan@chromium.org>
Reviewed-by: Jakob Gruber <jgruber@chromium.org>
Cr-Commit-Position: refs/heads/master@{#71220}

ba681fdb

12 Nov, 2020 1 commit

[heap] Do not use V8_LIKELY on FLAG_disable_write_barriers. · 4a89c018

Pierre Langlois authored 4 years ago

FLAG_disable_write_barriers is a constexpr so the V8_LIKELY macro isn't
necessary. Interestingly, it can also cause clang to warn that the code
is unreachable, whereas without `__builtin_expect()` the compiler
doesn't mind. See for example:

```
constexpr bool kNo = false;

void warns() {
  if (__builtin_expect(kNo, 0)) {
    int a = 42;
  }
}

void does_not_warn() {
  if (kNo) {
    int a = 42;
  }
}
```

Compiling V8 for arm64 with both `v8_disable_write_barriers = true` and
`v8_enable_pointer_compression = false` would trigger this warning.

Bug: v8:9533
Change-Id: Id2ae156d60217007bb9ebf50628e8908e0193d05
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2534811Reviewed-by: Ulan Degenbaev <ulan@chromium.org>
Reviewed-by: Georg Neis <neis@chromium.org>
Commit-Queue: Pierre Langlois <pierre.langlois@arm.com>
Cr-Commit-Position: refs/heads/master@{#71157}

4a89c018

05 Nov, 2020 1 commit

[wasm-simd][x64] Optimize integer splats of constant 0 · 7d7b25d9

Zhi An Ng authored 4 years ago

Integer splats (especially for sizes < 32-bits) does not directly
translate to a single instruction on x64. We can do better for special
values, like 0, which can be lowered to `xor dst dst`. We do this check
in the instruction selector, and emit a special opcode kX64S128Zero.

Also change the xor operation for kX64S128Zero from xorps to pxor. This
can help reduce any potential data bypass delay (search for this on
agner's microarchitecture manual for more details.). Since integer
splats are likely to be followed by integer ops, we should remain in the
integer domain, thus use pxor.

For i64x2.splat the codegen goes from:

  xorl rdi,rdi
  vmovq xmm0,rdi
  vmovddup xmm0,xmm0

to:

  vpxor xmm0,xmm0,xmm0

Also add a unittest to verify this optimization, and necessary
raw-assembler methods for the test.

Bug: v8:11093
Change-Id: I26b092032b6e672f1d5d26e35d79578ebe591cfe
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2516299Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70977}

7d7b25d9

04 Nov, 2020 2 commits

Revert "[wasm-simd][x64] Optimize pmin/pmax and add horiz for AVX" · 4f4dda3f

Clemens Backes authored 4 years ago

This reverts commit 3c4e434f.

Reason for revert: Fails noavx tests: https://ci.chromium.org/p/v8/builders/ci/V8%20Linux64%20-%20debug/34613

Original change's description:
> [wasm-simd][x64] Optimize pmin/pmax and add horiz for AVX
>
> The AVX versions of these instructions can take 3 operands, so we don't
> need to force dst == src.
>
> Bug: v8:9561
> Change-Id: If346a05f7d599bf0d636263cafc3bc823c3b8452
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2515337
> Reviewed-by: Clemens Backes <clemensb@chromium.org>
> Commit-Queue: Zhi An Ng <zhin@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#70958}

TBR=clemensb@chromium.org,zhin@chromium.org

Change-Id: I5fcdd2e51d418cb32a1b1e2bec7c0dff19f29154
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: v8:9561
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2519558Reviewed-by: Clemens Backes <clemensb@chromium.org>
Commit-Queue: Clemens Backes <clemensb@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70961}

4f4dda3f

[wasm-simd][x64] Optimize pmin/pmax and add horiz for AVX · 3c4e434f

Zhi An Ng authored 4 years ago

The AVX versions of these instructions can take 3 operands, so we don't
need to force dst == src.

Bug: v8:9561
Change-Id: If346a05f7d599bf0d636263cafc3bc823c3b8452
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2515337Reviewed-by: Clemens Backes <clemensb@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70958}

3c4e434f

02 Nov, 2020 1 commit

[wasm-simd] Enhance Shufps to copy src to dst · 14570fe0

Zhi An Ng authored 4 years ago

Extract Shufps to handle both AVX and SSE cases, in the SSE case it will
copy src to dst if they are not the same. This allows us to use it in
Liftoff as well, without the extra copy when AVX is supported.

In other places, the usage of Shufps is unnecessary, since they are
within a clause checking for non-AVX support, so we can simply use the
shufps (non-macro-assembler).

Bug: v8:9561
Change-Id: Icb043d7a43397c1b0810ece2666be567f0f5986c
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2513866Reviewed-by: Clemens Backes <clemensb@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70911}

14570fe0

30 Oct, 2020 1 commit

[wasm-simd][x64] Consolidate some instructions into macro list · 02b79c2b

Zhi An Ng authored 4 years ago

These operations can be moved into an existing macro list, since they
are simple operations that generate only 1 instruction. The benefit is
that they have support for AVX 3-operand instruction, and does not have
to force dst to be equals to src.

Bug: v8:9561
Change-Id: I9ec1d2496d14cb9f0fb3b4854ca39887eb5bf49b
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2505240Reviewed-by: Clemens Backes <clemensb@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70893}

02b79c2b

29 Oct, 2020 2 commits

[wasm-simd][x64] Don't fix dst to src on AVX · d4f7ea80

Zhi An Ng authored 4 years ago

On AVX, many instructions can have 3 operands, unlike SSE which only has
2. So on SSE we use DefineSameAsFirst on the dst. But on AVX, using that
will cause some unnecessary moves.

This change moves a bunch of instructions that have single instruction
codegen into a macro list which supports the this non-restricted AVX
codegen.

Bug: v8:9561
Change-Id: I348a8396e8a1129daf2e1ed08ae8526e1bc3a73b
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2505254Reviewed-by: Clemens Backes <clemensb@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70888}

d4f7ea80

Reland "[wasm-simd][ia32][x64] Only use registers for shuffles" · 45cb1ce0

Zhi An Ng authored 4 years ago

This is a reland of 3fb07882

Original change's description:
> [wasm-simd][ia32][x64] Only use registers for shuffles
>
> Shuffles have pattern matching clauses which, depending on the
> instruction used, can require src0 or src1 to be register or not.
> However we do not have 16-byte alignment for SIMD operands yet, so it
> will segfault when we use an SSE SIMD instruction with unaligned
> operands.
>
> This patch fixes all the shuffle cases to always use a register for the
> input nodes, and it does so by ignoring the values of src0_needs_reg and
> src1_needs_reg. When we eventually have memory alignment, we can
> re-enable this check, without mucking around too much in the logic in
> each shuffle match clause.
>
> Bug: v8:9198
> Change-Id: I264e136f017353019f19954c62c88206f7b90656
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2504849
> Reviewed-by: Andreas Haas <ahaas@chromium.org>
> Reviewed-by: Adam Klein <adamk@chromium.org>
> Commit-Queue: Adam Klein <adamk@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#70848}

Bug: v8:9198
Change-Id: I40c6c8f0cd8908a2d6ab7016d8ed4d4fb2ab4114
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2505250Reviewed-by: Adam Klein <adamk@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70862}

45cb1ce0

28 Oct, 2020 4 commits

Revert "[wasm-simd][ia32][x64] Only use registers for shuffles" · a308d3fe

Francis McCabe authored 4 years ago

This reverts commit 3fb07882.

Reason for revert: failing noavx tests:
https://ci.chromium.org/p/v8/builders/ci/V8%20Linux/39390?


Original change's description:
> [wasm-simd][ia32][x64] Only use registers for shuffles
>
> Shuffles have pattern matching clauses which, depending on the
> instruction used, can require src0 or src1 to be register or not.
> However we do not have 16-byte alignment for SIMD operands yet, so it
> will segfault when we use an SSE SIMD instruction with unaligned
> operands.
>
> This patch fixes all the shuffle cases to always use a register for the
> input nodes, and it does so by ignoring the values of src0_needs_reg and
> src1_needs_reg. When we eventually have memory alignment, we can
> re-enable this check, without mucking around too much in the logic in
> each shuffle match clause.
>
> Bug: v8:9198
> Change-Id: I264e136f017353019f19954c62c88206f7b90656
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2504849
> Reviewed-by: Andreas Haas <ahaas@chromium.org>
> Reviewed-by: Adam Klein <adamk@chromium.org>
> Commit-Queue: Adam Klein <adamk@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#70848}

TBR=adamk@chromium.org,ahaas@chromium.org,zhin@chromium.org

Change-Id: Icc7cc1ceb7ca5aa5d859239330743dde2e5f213c
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: v8:9198
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2505719Reviewed-by: Francis McCabe <fgm@chromium.org>
Commit-Queue: Francis McCabe <fgm@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70852}

a308d3fe

[turbofan] Pierce TypeGuards and FoldConstants in ValueMatcher · 34610db8

Shu-yu Guo authored 4 years ago

Change-Id: I4ab54dac771bb551c2435a98f9e53194a6f27853
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2495494
Commit-Queue: Shu-yu Guo <syg@chromium.org>
Reviewed-by: Georg Neis <neis@chromium.org>
Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70851}

34610db8

[wasm-simd][ia32][x64] Only use registers for shuffles · 3fb07882

Zhi An Ng authored 4 years ago

Shuffles have pattern matching clauses which, depending on the
instruction used, can require src0 or src1 to be register or not.
However we do not have 16-byte alignment for SIMD operands yet, so it
will segfault when we use an SSE SIMD instruction with unaligned
operands.

This patch fixes all the shuffle cases to always use a register for the
input nodes, and it does so by ignoring the values of src0_needs_reg and
src1_needs_reg. When we eventually have memory alignment, we can
re-enable this check, without mucking around too much in the logic in
each shuffle match clause.

Bug: v8:9198
Change-Id: I264e136f017353019f19954c62c88206f7b90656
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2504849Reviewed-by: Andreas Haas <ahaas@chromium.org>
Reviewed-by: Adam Klein <adamk@chromium.org>
Commit-Queue: Adam Klein <adamk@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70848}

3fb07882

[wasm-simd][x64] Prototype sign select · b0d79120

Zhi An Ng authored 4 years ago

Prototype i8x16, i16x8, i32x4, i64x2 sign select on x64 and interpreter.

Bug: v8:10983
Change-Id: I7d6f39a2cb4c2aefe31daac782978fe8b363dd1a
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2486235
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
Reviewed-by: Bill Budge <bbudge@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70818}

b0d79120

27 Oct, 2020 1 commit

[wasm-simd][x64] Merge i8x16 and i16x8 extract lane with pextr · 165a7ad6

Ng Zhi An authored 4 years ago

i8x16.extract_lane_u is pextrb, and i16x8.extract_lane_u is pextrw, we
can merge them instead of having separate opcodes.

R=bbudge@chromium.org

Bug: v8:10975
Change-Id: I7793a795905157b6094b1470d3437988c982af91
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2481834Reviewed-by: Bill Budge <bbudge@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70771}

165a7ad6

22 Oct, 2020 1 commit

Revert "[ia32,x64] Make more use of the 'leave' instruction" · e02a625a

Georg Neis authored 4 years ago

This reverts half of commit 8f0ab471.

Reason for revert: some performance regressions, possibly due
to 'leave' needing MSROM on some microarchitectures.

The half that is not reverted is the removal of 'enter'.


Original change's description:
> [ia32,x64] Make more use of the 'leave' instruction
>
> It is a little shorter and cheaper[1] than the equivalent
> "mov sp,bp; pop bp".
>
> Also remove support for the 'enter' instruction, since
> - it is unused,
> - it is neither shorter nor cheaper than the corresponding
>   push and mov (in fact more expensive[1]), and
> - our disassembler doesn't support it.
>
> [1] See https://www.agner.org/optimize/instruction_tables.pdf
>
> Change-Id: I6c99c2f3e53081aea55445a54e18eaf45baa79c2
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2482822
> Commit-Queue: Georg Neis <neis@chromium.org>
> Reviewed-by: Victor Gomes <victorgomes@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#70660}

TBR=neis@chromium.org,victorgomes@chromium.org
Bug: chromium:1141069
# Not skipping CQ checks because original CL landed > 1 day ago.

Change-Id: I5c9ad64ee06b71c93eff256044ce49d1523737fb
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2492327
Commit-Queue: Georg Neis <neis@chromium.org>
Reviewed-by: Victor Gomes <victorgomes@chromium.org>
Reviewed-by: Georg Neis <neis@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70718}

e02a625a

21 Oct, 2020 1 commit

Reland "Reland "[deoptimizer] Change deopt entries into builtins"" · c7cb9bec

Jakob Gruber authored 4 years ago

This is a reland of fbfa9bf4

The arm64 was missing proper codegen for CFI, thus sizes were off.

Original change's description:
> Reland "[deoptimizer] Change deopt entries into builtins"
>
> This is a reland of 7f58ced7
>
> It fixes the different exit size emitted on x64/Atom CPUs due to
> performance tuning in TurboAssembler::Call. Additionally, add
> cctests to verify the fixed size exits.
>
> Original change's description:
> > [deoptimizer] Change deopt entries into builtins
> >
> > While the overall goal of this commit is to change deoptimization
> > entries into builtins, there are multiple related things happening:
> >
> > - Deoptimization entries, formerly stubs (i.e. Code objects generated
> >   at runtime, guaranteed to be immovable), have been converted into
> >   builtins. The major restriction is that we now need to preserve the
> >   kRootRegister, which was formerly used on most architectures to pass
> >   the deoptimization id. The solution differs based on platform.
> > - Renamed DEOPT_ENTRIES_OR_FOR_TESTING code kind to FOR_TESTING.
> > - Removed heap/ support for immovable Code generation.
> > - Removed the DeserializerData class (no longer needed).
> > - arm64: to preserve 4-byte deopt exits, introduced a new optimization
> >   in which the final jump to the deoptimization entry is generated
> >   once per Code object, and deopt exits can continue to emit a
> >   near-call.
> > - arm,ia32,x64: change to fixed-size deopt exits. This reduces exit
> >   sizes by 4/8, 5, and 5 bytes, respectively.
> >
> > On arm the deopt exit size is reduced from 12 (or 16) bytes to 8 bytes
> > by using the same strategy as on arm64 (recalc deopt id from return
> > address). Before:
> >
> >  e300a002       movw r10, <id>
> >  e59fc024       ldr ip, [pc, <entry offset>]
> >  e12fff3c       blx ip
> >
> > After:
> >
> >  e59acb35       ldr ip, [r10, <entry offset>]
> >  e12fff3c       blx ip
> >
> > On arm64 the deopt exit size remains 4 bytes (or 8 bytes in same cases
> > with CFI). Additionally, up to 4 builtin jumps are emitted per Code
> > object (max 32 bytes added overhead per Code object). Before:
> >
> >  9401cdae       bl <entry offset>
> >
> > After:
> >
> >  # eager deoptimization entry jump.
> >  f95b1f50       ldr x16, [x26, <eager entry offset>]
> >  d61f0200       br x16
> >  # lazy deoptimization entry jump.
> >  f95b2b50       ldr x16, [x26, <lazy entry offset>]
> >  d61f0200       br x16
> >  # the deopt exit.
> >  97fffffc       bl <eager deoptimization entry jump offset>
> >
> > On ia32 the deopt exit size is reduced from 10 to 5 bytes. Before:
> >
> >  bb00000000     mov ebx,<id>
> >  e825f5372b     call <entry>
> >
> > After:
> >
> >  e8ea2256ba     call <entry>
> >
> > On x64 the deopt exit size is reduced from 12 to 7 bytes. Before:
> >
> >  49c7c511000000 REX.W movq r13,<id>
> >  e8ea2f0700     call <entry>
> >
> > After:
> >
> >  41ff9560360000 call [r13+<entry offset>]
> >
> > Bug: v8:8661,v8:8768
> > Change-Id: I13e30aedc360474dc818fecc528ce87c3bfeed42
> > Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2465834
> > Commit-Queue: Jakob Gruber <jgruber@chromium.org>
> > Reviewed-by: Ross McIlroy <rmcilroy@chromium.org>
> > Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
> > Reviewed-by: Ulan Degenbaev <ulan@chromium.org>
> > Cr-Commit-Position: refs/heads/master@{#70597}
>
> Tbr: ulan@chromium.org, tebbi@chromium.org, rmcilroy@chromium.org
> Bug: v8:8661,v8:8768,chromium:1140165
> Change-Id: Ibcd5c39c58a70bf2b2ac221aa375fc68d495e144
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2485506
> Reviewed-by: Jakob Gruber <jgruber@chromium.org>
> Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
> Commit-Queue: Jakob Gruber <jgruber@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#70655}

Tbr: ulan@chromium.org, tebbi@chromium.org, rmcilroy@chromium.org
Bug: v8:8661
Bug: v8:8768
Bug: chromium:1140165
Change-Id: I471cc94fc085e527dc9bfb5a84b96bd907c2333f
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2488682Reviewed-by: Jakob Gruber <jgruber@chromium.org>
Commit-Queue: Jakob Gruber <jgruber@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70672}

c7cb9bec

20 Oct, 2020 4 commits

[ia32,x64] Make more use of the 'leave' instruction · 8f0ab471

Georg Neis authored 4 years ago

It is a little shorter and cheaper[1] than the equivalent
"mov sp,bp; pop bp".

Also remove support for the 'enter' instruction, since
- it is unused,
- it is neither shorter nor cheaper than the corresponding
  push and mov (in fact more expensive[1]), and
- our disassembler doesn't support it.

[1] See https://www.agner.org/optimize/instruction_tables.pdf

Change-Id: I6c99c2f3e53081aea55445a54e18eaf45baa79c2
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2482822
Commit-Queue: Georg Neis <neis@chromium.org>
Reviewed-by: Victor Gomes <victorgomes@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70660}

8f0ab471

Revert "Reland "[deoptimizer] Change deopt entries into builtins"" · 7c7aa4fa

Maya Lekova authored 4 years ago

This reverts commit fbfa9bf4.

Reason for revert: Seems to break arm64 sim CFI build (please see DeoptExitSizeIfFixed) - https://ci.chromium.org/p/v8/builders/ci/V8%20Linux%20-%20arm64%20-%20sim%20-%20CFI/2808

Original change's description:
> Reland "[deoptimizer] Change deopt entries into builtins"
>
> This is a reland of 7f58ced7
>
> It fixes the different exit size emitted on x64/Atom CPUs due to
> performance tuning in TurboAssembler::Call. Additionally, add
> cctests to verify the fixed size exits.
>
> Original change's description:
> > [deoptimizer] Change deopt entries into builtins
> >
> > While the overall goal of this commit is to change deoptimization
> > entries into builtins, there are multiple related things happening:
> >
> > - Deoptimization entries, formerly stubs (i.e. Code objects generated
> >   at runtime, guaranteed to be immovable), have been converted into
> >   builtins. The major restriction is that we now need to preserve the
> >   kRootRegister, which was formerly used on most architectures to pass
> >   the deoptimization id. The solution differs based on platform.
> > - Renamed DEOPT_ENTRIES_OR_FOR_TESTING code kind to FOR_TESTING.
> > - Removed heap/ support for immovable Code generation.
> > - Removed the DeserializerData class (no longer needed).
> > - arm64: to preserve 4-byte deopt exits, introduced a new optimization
> >   in which the final jump to the deoptimization entry is generated
> >   once per Code object, and deopt exits can continue to emit a
> >   near-call.
> > - arm,ia32,x64: change to fixed-size deopt exits. This reduces exit
> >   sizes by 4/8, 5, and 5 bytes, respectively.
> >
> > On arm the deopt exit size is reduced from 12 (or 16) bytes to 8 bytes
> > by using the same strategy as on arm64 (recalc deopt id from return
> > address). Before:
> >
> >  e300a002       movw r10, <id>
> >  e59fc024       ldr ip, [pc, <entry offset>]
> >  e12fff3c       blx ip
> >
> > After:
> >
> >  e59acb35       ldr ip, [r10, <entry offset>]
> >  e12fff3c       blx ip
> >
> > On arm64 the deopt exit size remains 4 bytes (or 8 bytes in same cases
> > with CFI). Additionally, up to 4 builtin jumps are emitted per Code
> > object (max 32 bytes added overhead per Code object). Before:
> >
> >  9401cdae       bl <entry offset>
> >
> > After:
> >
> >  # eager deoptimization entry jump.
> >  f95b1f50       ldr x16, [x26, <eager entry offset>]
> >  d61f0200       br x16
> >  # lazy deoptimization entry jump.
> >  f95b2b50       ldr x16, [x26, <lazy entry offset>]
> >  d61f0200       br x16
> >  # the deopt exit.
> >  97fffffc       bl <eager deoptimization entry jump offset>
> >
> > On ia32 the deopt exit size is reduced from 10 to 5 bytes. Before:
> >
> >  bb00000000     mov ebx,<id>
> >  e825f5372b     call <entry>
> >
> > After:
> >
> >  e8ea2256ba     call <entry>
> >
> > On x64 the deopt exit size is reduced from 12 to 7 bytes. Before:
> >
> >  49c7c511000000 REX.W movq r13,<id>
> >  e8ea2f0700     call <entry>
> >
> > After:
> >
> >  41ff9560360000 call [r13+<entry offset>]
> >
> > Bug: v8:8661,v8:8768
> > Change-Id: I13e30aedc360474dc818fecc528ce87c3bfeed42
> > Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2465834
> > Commit-Queue: Jakob Gruber <jgruber@chromium.org>
> > Reviewed-by: Ross McIlroy <rmcilroy@chromium.org>
> > Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
> > Reviewed-by: Ulan Degenbaev <ulan@chromium.org>
> > Cr-Commit-Position: refs/heads/master@{#70597}
>
> Tbr: ulan@chromium.org, tebbi@chromium.org, rmcilroy@chromium.org
> Bug: v8:8661,v8:8768,chromium:1140165
> Change-Id: Ibcd5c39c58a70bf2b2ac221aa375fc68d495e144
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2485506
> Reviewed-by: Jakob Gruber <jgruber@chromium.org>
> Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
> Commit-Queue: Jakob Gruber <jgruber@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#70655}

TBR=ulan@chromium.org,rmcilroy@chromium.org,jgruber@chromium.org,tebbi@chromium.org

Change-Id: I4739a3475bfd8ee0cfbe4b9a20382f91a6ef1bf0
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: v8:8661
Bug: v8:8768
Bug: chromium:1140165
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2485223Reviewed-by: Maya Lekova <mslekova@chromium.org>
Commit-Queue: Maya Lekova <mslekova@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70658}

7c7aa4fa

Reland "[deoptimizer] Change deopt entries into builtins" · fbfa9bf4

Jakob Gruber authored 4 years ago

This is a reland of 7f58ced7

It fixes the different exit size emitted on x64/Atom CPUs due to
performance tuning in TurboAssembler::Call. Additionally, add
cctests to verify the fixed size exits.

Original change's description:
> [deoptimizer] Change deopt entries into builtins
>
> While the overall goal of this commit is to change deoptimization
> entries into builtins, there are multiple related things happening:
>
> - Deoptimization entries, formerly stubs (i.e. Code objects generated
>   at runtime, guaranteed to be immovable), have been converted into
>   builtins. The major restriction is that we now need to preserve the
>   kRootRegister, which was formerly used on most architectures to pass
>   the deoptimization id. The solution differs based on platform.
> - Renamed DEOPT_ENTRIES_OR_FOR_TESTING code kind to FOR_TESTING.
> - Removed heap/ support for immovable Code generation.
> - Removed the DeserializerData class (no longer needed).
> - arm64: to preserve 4-byte deopt exits, introduced a new optimization
>   in which the final jump to the deoptimization entry is generated
>   once per Code object, and deopt exits can continue to emit a
>   near-call.
> - arm,ia32,x64: change to fixed-size deopt exits. This reduces exit
>   sizes by 4/8, 5, and 5 bytes, respectively.
>
> On arm the deopt exit size is reduced from 12 (or 16) bytes to 8 bytes
> by using the same strategy as on arm64 (recalc deopt id from return
> address). Before:
>
>  e300a002       movw r10, <id>
>  e59fc024       ldr ip, [pc, <entry offset>]
>  e12fff3c       blx ip
>
> After:
>
>  e59acb35       ldr ip, [r10, <entry offset>]
>  e12fff3c       blx ip
>
> On arm64 the deopt exit size remains 4 bytes (or 8 bytes in same cases
> with CFI). Additionally, up to 4 builtin jumps are emitted per Code
> object (max 32 bytes added overhead per Code object). Before:
>
>  9401cdae       bl <entry offset>
>
> After:
>
>  # eager deoptimization entry jump.
>  f95b1f50       ldr x16, [x26, <eager entry offset>]
>  d61f0200       br x16
>  # lazy deoptimization entry jump.
>  f95b2b50       ldr x16, [x26, <lazy entry offset>]
>  d61f0200       br x16
>  # the deopt exit.
>  97fffffc       bl <eager deoptimization entry jump offset>
>
> On ia32 the deopt exit size is reduced from 10 to 5 bytes. Before:
>
>  bb00000000     mov ebx,<id>
>  e825f5372b     call <entry>
>
> After:
>
>  e8ea2256ba     call <entry>
>
> On x64 the deopt exit size is reduced from 12 to 7 bytes. Before:
>
>  49c7c511000000 REX.W movq r13,<id>
>  e8ea2f0700     call <entry>
>
> After:
>
>  41ff9560360000 call [r13+<entry offset>]
>
> Bug: v8:8661,v8:8768
> Change-Id: I13e30aedc360474dc818fecc528ce87c3bfeed42
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2465834
> Commit-Queue: Jakob Gruber <jgruber@chromium.org>
> Reviewed-by: Ross McIlroy <rmcilroy@chromium.org>
> Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
> Reviewed-by: Ulan Degenbaev <ulan@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#70597}

Tbr: ulan@chromium.org, tebbi@chromium.org, rmcilroy@chromium.org
Bug: v8:8661,v8:8768,chromium:1140165
Change-Id: Ibcd5c39c58a70bf2b2ac221aa375fc68d495e144
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2485506Reviewed-by: Jakob Gruber <jgruber@chromium.org>
Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
Commit-Queue: Jakob Gruber <jgruber@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70655}

fbfa9bf4

Revert "[deoptimizer] Change deopt entries into builtins" · 8bc9a794

Jakob Gruber authored 4 years ago

This reverts commit 7f58ced7.

Reason for revert: Segfaults on Atom_x64 https://ci.chromium.org/p/v8-internal/builders/ci/v8_linux64_atom_perf/5686?

Original change's description:
> [deoptimizer] Change deopt entries into builtins
>
> While the overall goal of this commit is to change deoptimization
> entries into builtins, there are multiple related things happening:
>
> - Deoptimization entries, formerly stubs (i.e. Code objects generated
>   at runtime, guaranteed to be immovable), have been converted into
>   builtins. The major restriction is that we now need to preserve the
>   kRootRegister, which was formerly used on most architectures to pass
>   the deoptimization id. The solution differs based on platform.
> - Renamed DEOPT_ENTRIES_OR_FOR_TESTING code kind to FOR_TESTING.
> - Removed heap/ support for immovable Code generation.
> - Removed the DeserializerData class (no longer needed).
> - arm64: to preserve 4-byte deopt exits, introduced a new optimization
>   in which the final jump to the deoptimization entry is generated
>   once per Code object, and deopt exits can continue to emit a
>   near-call.
> - arm,ia32,x64: change to fixed-size deopt exits. This reduces exit
>   sizes by 4/8, 5, and 5 bytes, respectively.
>
> On arm the deopt exit size is reduced from 12 (or 16) bytes to 8 bytes
> by using the same strategy as on arm64 (recalc deopt id from return
> address). Before:
>
>  e300a002       movw r10, <id>
>  e59fc024       ldr ip, [pc, <entry offset>]
>  e12fff3c       blx ip
>
> After:
>
>  e59acb35       ldr ip, [r10, <entry offset>]
>  e12fff3c       blx ip
>
> On arm64 the deopt exit size remains 4 bytes (or 8 bytes in same cases
> with CFI). Additionally, up to 4 builtin jumps are emitted per Code
> object (max 32 bytes added overhead per Code object). Before:
>
>  9401cdae       bl <entry offset>
>
> After:
>
>  # eager deoptimization entry jump.
>  f95b1f50       ldr x16, [x26, <eager entry offset>]
>  d61f0200       br x16
>  # lazy deoptimization entry jump.
>  f95b2b50       ldr x16, [x26, <lazy entry offset>]
>  d61f0200       br x16
>  # the deopt exit.
>  97fffffc       bl <eager deoptimization entry jump offset>
>
> On ia32 the deopt exit size is reduced from 10 to 5 bytes. Before:
>
>  bb00000000     mov ebx,<id>
>  e825f5372b     call <entry>
>
> After:
>
>  e8ea2256ba     call <entry>
>
> On x64 the deopt exit size is reduced from 12 to 7 bytes. Before:
>
>  49c7c511000000 REX.W movq r13,<id>
>  e8ea2f0700     call <entry>
>
> After:
>
>  41ff9560360000 call [r13+<entry offset>]
>
> Bug: v8:8661,v8:8768
> Change-Id: I13e30aedc360474dc818fecc528ce87c3bfeed42
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2465834
> Commit-Queue: Jakob Gruber <jgruber@chromium.org>
> Reviewed-by: Ross McIlroy <rmcilroy@chromium.org>
> Reviewed-by: Tobias Tebbi <tebbi@chromium.org>
> Reviewed-by: Ulan Degenbaev <ulan@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#70597}

TBR=ulan@chromium.org,rmcilroy@chromium.org,jgruber@chromium.org,tebbi@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug: v8:8661,v8:8768,chromium:1140165
Change-Id: I3df02ab42f6e02233d9f6fb80e8bb18f76870d91
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2485504Reviewed-by: Jakob Gruber <jgruber@chromium.org>
Commit-Queue: Jakob Gruber <jgruber@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70649}

8bc9a794

19 Oct, 2020 4 commits

[wasm-simd][x64] Optimize more ops for AVX · 89d9eb73

Ng Zhi An authored 4 years ago

All these opcodes have a simple lowering into a single x64 instruction.
We can perform a similar optimization when AVX is supported to not force
dst == src1.

Bug: v8:10116
Change-Id: I4ad2975b6f241d8209025682202b476c08b3491b
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2486383Reviewed-by: Bill Budge <bbudge@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70636}

89d9eb73

[wasm-simd][x64] Consolidate v128.load_zero with movss/movsd · c77dd2ff

Ng Zhi An authored 4 years ago

We don't need separate Load32Zero and Load64Zero instructions, since the
implementation is movss and movsd, which we already have.

Bug: v8:10713
Change-Id: I5d02e946f3bf9fe08f943a811f2d3cc8aec81ea8
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2486233Reviewed-by: Bill Budge <bbudge@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70635}

c77dd2ff

[wasm-simd] Rename v128.load32_zero to follow proposal · 9738fb5e

Ng Zhi An authored 4 years ago

Not sure why I originally chose to name it LoadMem32Zero instead of
Load32Zero like the proposal. This fixes it.

Bug: v8:10713
Change-Id: If05603f743213bc6b7aea0ce22c80ae4b3023ccf
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2481824Reviewed-by: Bill Budge <bbudge@chromium.org>
Reviewed-by: Georg Neis <neis@chromium.org>
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70630}

9738fb5e

[wasm-simd][x64] Optimize f32x4 splat and extract lanes · 4068b3d2

Ng Zhi An authored 4 years ago

For splats, we can make use of vshufps to avoid a movss. Without
AVX, specific dst to be same as src in the instruction selector.

For extract lane, we can use vshufps to extract a float into a dst xmm,
and leave junk in the higher bits.

On the meshopt_decoder.js benchmark in linked bug, it removes about 7
movss instructions that did nothing. Hardware can do register renaming,
but let's not rely on that :)

R=bbudge@chromium.org

Bug: v8:10116
Change-Id: I4d68c10536a79659de673060d537d58113308477
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2481473
Commit-Queue: Zhi An Ng <zhin@chromium.org>
Reviewed-by: Bill Budge <bbudge@chromium.org>
Cr-Commit-Position: refs/heads/master@{#70628}

4068b3d2