• Iain Ireland's avatar
    [regexp] Propagate eats_at_least for negative lookahead · 363ab5ae
    Iain Ireland authored
    In issue 11290, we disabled the propagation of EAL data out of
    lookarounds, because it was incorrect for lookahead nodes in
    loops. This caused performance regressions: for example,
    `/^\P{Letter}+$/u` (matching only characters that are not in Unicode's
    Letter category) uses negative lookahead when matching lone
    surrogates, and became about 2x slower. I spent some time looking into
    fixes, and this is what I've settled on.
    
    Some background: the implementation of lookarounds in irregexp is
    split between positive and negative lookaheads. (Lookbehinds aren't
    relevant here, because backwards matches always have EAL=0.)  Positive
    lookaheads are wrapped in BEGIN_SUBMATCH and POSITIVE_SUBMATCH_SUCCESS
    ActionNodes. BEGIN_SUBMATCH saves the current state.
    POSITIVE_SUBMATCH_SUCCESS restores the necessary state (while leaving
    any captures that occurred during the lookaround intact).
    
    Negative lookaheads also begin with a BEGIN_SUBMATCH node, but follow
    it with a NegativeLookaroundChoiceNode. This node has two successors:
    a lookaround node, and a continue node. It only executes the continue
    node if the lookaround node backtracks, which automatically restores
    the previous state. Negative lookarounds also can't update captures.
    
    This affects EAL calculations. It turns out that negative lookaheads
    are already doing the right thing: EatsAtLeastPropagator only
    propagates information from the continue node, ignoring the lookaround
    node. The same is true for quick checks (see the comment in
    RegExpLookaround:Builder::ForMatch). A BEGIN_SUBMATCH for a negative
    lookahead can simply propagate the EAL data from its successor like
    any other ActionNode, and everything works.
    
    Positive lookaheads are harder. I tried saving a pointer to the
    successor in BEGIN_SUBMATCH, but ran into problems in FillInBMInfo,
    because the EAL value corresponded to the nodes after the lookahead,
    but the analysis was still looking at the nodes inside. I fell back
    to a more modest approach: split BEGIN_SUBMATCH in two, and propagate
    EAL info for BEGIN_NEGATIVE_SUBMATCH while keeping the current
    behaviour for BEGIN_POSITIVE_SUBMATCH. This fixes the performance
    regression at hand.
    
    Two potential approaches for fixing EAL for positive lookahead are:
     1. Handling positive lookahead with its own dedicated choice node,
        like NegativeLookaroundChoiceNode.
     2. Adding an eats_at_least_inside_loop field to EatsAtLeastInfo,
        which is <= eats_at_least_from_possibly_start, and using that
        value in EatsAtLeastFromLoopEntry.
    
    Both of those approaches are more complex than I want to tackle
    right now, though.
    
    Bug: v8:11844
    Change-Id: I2a43509c2c21194b8c18f0a587fa21c194db76c2
    Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2934858Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
    Commit-Queue: Jakob Gruber <jgruber@chromium.org>
    Cr-Commit-Position: refs/heads/master@{#75031}
    363ab5ae
regexp-nodes.h 30.9 KB