Commit 363ab5ae authored by Iain Ireland's avatar Iain Ireland Committed by V8 LUCI CQ

[regexp] Propagate eats_at_least for negative lookahead

In issue 11290, we disabled the propagation of EAL data out of
lookarounds, because it was incorrect for lookahead nodes in
loops. This caused performance regressions: for example,
`/^\P{Letter}+$/u` (matching only characters that are not in Unicode's
Letter category) uses negative lookahead when matching lone
surrogates, and became about 2x slower. I spent some time looking into
fixes, and this is what I've settled on.

Some background: the implementation of lookarounds in irregexp is
split between positive and negative lookaheads. (Lookbehinds aren't
relevant here, because backwards matches always have EAL=0.)  Positive
lookaheads are wrapped in BEGIN_SUBMATCH and POSITIVE_SUBMATCH_SUCCESS
ActionNodes. BEGIN_SUBMATCH saves the current state.
POSITIVE_SUBMATCH_SUCCESS restores the necessary state (while leaving
any captures that occurred during the lookaround intact).

Negative lookaheads also begin with a BEGIN_SUBMATCH node, but follow
it with a NegativeLookaroundChoiceNode. This node has two successors:
a lookaround node, and a continue node. It only executes the continue
node if the lookaround node backtracks, which automatically restores
the previous state. Negative lookarounds also can't update captures.

This affects EAL calculations. It turns out that negative lookaheads
are already doing the right thing: EatsAtLeastPropagator only
propagates information from the continue node, ignoring the lookaround
node. The same is true for quick checks (see the comment in
RegExpLookaround:Builder::ForMatch). A BEGIN_SUBMATCH for a negative
lookahead can simply propagate the EAL data from its successor like
any other ActionNode, and everything works.

Positive lookaheads are harder. I tried saving a pointer to the
successor in BEGIN_SUBMATCH, but ran into problems in FillInBMInfo,
because the EAL value corresponded to the nodes after the lookahead,
but the analysis was still looking at the nodes inside. I fell back
to a more modest approach: split BEGIN_SUBMATCH in two, and propagate
EAL info for BEGIN_NEGATIVE_SUBMATCH while keeping the current
behaviour for BEGIN_POSITIVE_SUBMATCH. This fixes the performance
regression at hand.

Two potential approaches for fixing EAL for positive lookahead are:
 1. Handling positive lookahead with its own dedicated choice node,
    like NegativeLookaroundChoiceNode.
 2. Adding an eats_at_least_inside_loop field to EatsAtLeastInfo,
    which is <= eats_at_least_from_possibly_start, and using that
    value in EatsAtLeastFromLoopEntry.

Both of those approaches are more complex than I want to tackle
right now, though.

Bug: v8:11844
Change-Id: I2a43509c2c21194b8c18f0a587fa21c194db76c2
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2934858Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
Commit-Queue: Jakob Gruber <jgruber@chromium.org>
Cr-Commit-Position: refs/heads/master@{#75031}
parent 4a78a7d0
......@@ -829,7 +829,7 @@ RegExpNode* RegExpAssertion::ToNode(RegExpCompiler* compiler,
-1, // Ignored if no captures.
on_success));
// Create an end-of-input matcher.
RegExpNode* end_of_line = ActionNode::BeginSubmatch(
RegExpNode* end_of_line = ActionNode::BeginPositiveSubmatch(
stack_pointer_register, position_register, newline_matcher);
// Add the two alternatives to the ChoiceNode.
GuardedAlternative eol_alternative(end_of_line);
......@@ -880,8 +880,8 @@ RegExpLookaround::Builder::Builder(bool is_positive, RegExpNode* on_success,
RegExpNode* RegExpLookaround::Builder::ForMatch(RegExpNode* match) {
if (is_positive_) {
return ActionNode::BeginSubmatch(stack_pointer_register_,
position_register_, match);
return ActionNode::BeginPositiveSubmatch(stack_pointer_register_,
position_register_, match);
} else {
Zone* zone = on_success_->zone();
// We use a ChoiceNode to represent the negative lookaround. The first
......@@ -891,8 +891,8 @@ RegExpNode* RegExpLookaround::Builder::ForMatch(RegExpNode* match) {
// first exit when calculating quick checks.
ChoiceNode* choice_node = zone->New<NegativeLookaroundChoiceNode>(
GuardedAlternative(match), GuardedAlternative(on_success_), zone);
return ActionNode::BeginSubmatch(stack_pointer_register_,
position_register_, choice_node);
return ActionNode::BeginNegativeSubmatch(stack_pointer_register_,
position_register_, choice_node);
}
}
......
......@@ -603,7 +603,7 @@ void NegativeSubmatchSuccess::Emit(RegExpCompiler* compiler, Trace* trace) {
assembler->ClearRegisters(clear_capture_start_, clear_capture_end);
}
// Now that we have unwound the stack we find at the top of the stack the
// backtrack that the BeginSubmatch node got.
// backtrack that the BeginNegativeSubmatch node got.
assembler->Backtrack();
}
......@@ -668,10 +668,19 @@ ActionNode* ActionNode::ClearCaptures(Interval range, RegExpNode* on_success) {
return result;
}
ActionNode* ActionNode::BeginSubmatch(int stack_reg, int position_reg,
RegExpNode* on_success) {
ActionNode* ActionNode::BeginPositiveSubmatch(int stack_reg, int position_reg,
RegExpNode* on_success) {
ActionNode* result =
on_success->zone()->New<ActionNode>(BEGIN_POSITIVE_SUBMATCH, on_success);
result->data_.u_submatch.stack_pointer_register = stack_reg;
result->data_.u_submatch.current_position_register = position_reg;
return result;
}
ActionNode* ActionNode::BeginNegativeSubmatch(int stack_reg, int position_reg,
RegExpNode* on_success) {
ActionNode* result =
on_success->zone()->New<ActionNode>(BEGIN_SUBMATCH, on_success);
on_success->zone()->New<ActionNode>(BEGIN_NEGATIVE_SUBMATCH, on_success);
result->data_.u_submatch.stack_pointer_register = stack_reg;
result->data_.u_submatch.current_position_register = position_reg;
return result;
......@@ -3340,7 +3349,8 @@ void ActionNode::Emit(RegExpCompiler* compiler, Trace* trace) {
on_success()->Emit(compiler, &new_trace);
break;
}
case BEGIN_SUBMATCH:
case BEGIN_POSITIVE_SUBMATCH:
case BEGIN_NEGATIVE_SUBMATCH:
if (!trace->is_trivial()) {
trace->Flush(compiler, this);
} else {
......@@ -3533,28 +3543,32 @@ class EatsAtLeastPropagator : public AllStatic {
}
static void VisitAction(ActionNode* that) {
// - BEGIN_SUBMATCH and POSITIVE_SUBMATCH_SUCCESS wrap lookarounds.
// Lookarounds rewind input, so their eats_at_least value must not
// propagate to surroundings.
// TODO(jgruber): Instead of resetting EAL to 0 at lookaround boundaries,
// analysis should instead skip over the lookaround and look at whatever
// follows the lookaround. A simple solution would be to store a pointer to
// the associated POSITIVE_SUBMATCH_SUCCESS node in the BEGIN_SUBMATCH
// node, and use that during analysis.
// - SET_REGISTER_FOR_LOOP indicates a loop entry point, which means the
// loop body will run at least the minimum number of times before the
// continuation case can run. Otherwise the current node eats at least as
// much as its successor.
switch (that->action_type()) {
case ActionNode::BEGIN_SUBMATCH:
case ActionNode::BEGIN_POSITIVE_SUBMATCH:
case ActionNode::POSITIVE_SUBMATCH_SUCCESS:
// We do not propagate eats_at_least data through positive lookarounds,
// because they rewind input.
// TODO(v8:11859) Potential approaches for fixing this include:
// 1. Add a dedicated choice node for positive lookaround, similar to
// NegativeLookaroundChoiceNode.
// 2. Add an eats_at_least_inside_loop field to EatsAtLeastInfo, which
// is <= eats_at_least_from_possibly_start, and use that value in
// EatsAtLeastFromLoopEntry.
DCHECK(that->eats_at_least_info()->IsZero());
break;
case ActionNode::SET_REGISTER_FOR_LOOP:
// SET_REGISTER_FOR_LOOP indicates a loop entry point, which means the
// loop body will run at least the minimum number of times before the
// continuation case can run.
that->set_eats_at_least_info(
that->on_success()->EatsAtLeastFromLoopEntry());
break;
case ActionNode::BEGIN_NEGATIVE_SUBMATCH:
default:
// Otherwise, the current node eats at least as much as its successor.
// Note: we can propagate eats_at_least data for BEGIN_NEGATIVE_SUBMATCH
// because NegativeLookaroundChoiceNode ignores its lookaround successor
// when computing eats-at-least and quick check information.
that->set_eats_at_least_info(*that->on_success()->eats_at_least_info());
break;
}
......
......@@ -209,9 +209,13 @@ void DotPrinterImpl::VisitAction(ActionNode* that) {
os_ << "label=\"$" << that->data_.u_position_register.reg
<< ":=$pos\", shape=octagon";
break;
case ActionNode::BEGIN_SUBMATCH:
case ActionNode::BEGIN_POSITIVE_SUBMATCH:
os_ << "label=\"$" << that->data_.u_submatch.current_position_register
<< ":=$pos,begin\", shape=septagon";
<< ":=$pos,begin-positive\", shape=septagon";
break;
case ActionNode::BEGIN_NEGATIVE_SUBMATCH:
os_ << "label=\"$" << that->data_.u_submatch.current_position_register
<< ":=$pos,begin-negative\", shape=septagon";
break;
case ActionNode::POSITIVE_SUBMATCH_SUCCESS:
os_ << "label=\"escape\", shape=septagon";
......
......@@ -314,7 +314,8 @@ class ActionNode : public SeqRegExpNode {
SET_REGISTER_FOR_LOOP,
INCREMENT_REGISTER,
STORE_POSITION,
BEGIN_SUBMATCH,
BEGIN_POSITIVE_SUBMATCH,
BEGIN_NEGATIVE_SUBMATCH,
POSITIVE_SUBMATCH_SUCCESS,
EMPTY_MATCH_CHECK,
CLEAR_CAPTURES
......@@ -325,8 +326,12 @@ class ActionNode : public SeqRegExpNode {
static ActionNode* StorePosition(int reg, bool is_capture,
RegExpNode* on_success);
static ActionNode* ClearCaptures(Interval range, RegExpNode* on_success);
static ActionNode* BeginSubmatch(int stack_pointer_reg, int position_reg,
RegExpNode* on_success);
static ActionNode* BeginPositiveSubmatch(int stack_pointer_reg,
int position_reg,
RegExpNode* on_success);
static ActionNode* BeginNegativeSubmatch(int stack_pointer_reg,
int position_reg,
RegExpNode* on_success);
static ActionNode* PositiveSubmatchSuccess(int stack_pointer_reg,
int restore_reg,
int clear_capture_count,
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment