1. 30 Oct, 2020 1 commit
    • Martin Bidlingmaier's avatar
      [regexp] Add 'l' flag to force experimental engine · 5720d205
      Martin Bidlingmaier authored
      This commit adds the 'l' (linear) RegExp flag (as in e.g. /asdf|123/l)
      that forces execution in linear time.  These regexps are handled by the
      experimental engine.  If the experimental engine cannot handle the
      pattern, an exception is thrown on creation of the regexp.
      
      The commit also adds a new global V8 flag and changes an existing one:
      * --enable-experimental-engine, which turns on recognition of the RegExp
        'l' flag.  Previously this flag also caused all supported regexps to
        be executed by the experimental engine; this is not the case anymore.
      * --default-to-experimental-regexp-engine takes over the previous
        semantics of --enable-experimental-regexp-engine:  We execute all
        supported regexps with the experimental engine.
      
      Cq-Include-Trybots: luci.v8.try:v8_linux64_fyi_rel_ng
      Bug: v8:10765
      Change-Id: I5622a89b19404105e8be280d454e9fdd63c003b3
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2461244Reviewed-by: 's avatarUlan Degenbaev <ulan@chromium.org>
      Reviewed-by: 's avatarGeorg Neis <neis@chromium.org>
      Reviewed-by: 's avatarSimon Zünd <szuend@chromium.org>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Commit-Queue: Martin Bidlingmaier <mbid@google.com>
      Cr-Commit-Position: refs/heads/master@{#70892}
      5720d205
  2. 23 Sep, 2020 1 commit
    • Martin Bidlingmaier's avatar
      [regexp] Support the msy flags in experimental engine · e6e9cbac
      Martin Bidlingmaier authored
      The m (multiline) and s (dotall) flags just needed to be marked as
      allowed; the required logic was already in the regexp parser.
      
      A regexp /<x>/ without the y (sticky) flag is equivalent to the sticky
      regexp /.*?<x>/y.  The interpreter now assumes that every regexp is
      sticky, and the compiler appends a preamble corresponding to /.*?/
      before non-sticky regexps.  To reuse existing code for compiling this
      preamble, the logic for each kind of quantifier is now in a separate
      function and called from VisitQuantifier and for the preamble.
      
      The commit also includes some improvements/fixes for character ranges:
      - Empty character ranges/disjunctions should never match, but before
        this commit they would *always* match.
      - The check of the range bounds in CanBeHandledVisitor was unncessary;
        without the unicode flag this can't be a range that can't be specified
        in 2-byte codepoints, and once we support unicode we simply support
        all codepoints.
      - The capacity of the list containing the complementary intervals of a
        character range is now calculated more accurately.
      
      Cq-Include-Trybots: luci.v8.try:v8_linux64_fyi_rel_ng
      Bug: v8:10765
      Change-Id: I71a0e07279b4e1140c0ed1651b3714200c801de9
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2404766
      Commit-Queue: Martin Bidlingmaier <mbid@google.com>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#70082}
      e6e9cbac
  3. 21 Sep, 2020 1 commit
    • Martin Bidlingmaier's avatar
      [regexp] Support assertions in experimental engine · e83511c2
      Martin Bidlingmaier authored
      Assertions are implemented with the new ASSERTION instruction.  The nfa
      interpreter evaluates the assertion based on the current context in the
      subject string every time a thread executes ASSERTION.  This is
      analogous to what re2 and rust/regex do.
      
      Alternatives to this approach:
      - The interpreter could calculate eagerly for all assertion types
        whether they are satisfied whenever the current input position is
        advanced.  This would make evaluating the ASSERTION instruction itself
        cheaper, but at the cost of making every advance in the input string
        more expensive.  I suspect this would be slower on average because
        assertions are not that common that we typically evaluate >= 2
        assertions at every input position.
      - Assertions in a regexp could be desugared into CONSUME_RANGE
        instructions, so that no new instruction would be necessary.  For
        example, the word boundary assertion \b is satisfied at a given
        position/state if we have just consumed a word character and will
        consume a non-word character next, or vice-versa.  The tricky part
        about this is that the assertion itself should not consume input, so
        we'd have to split (automaton) states according to whether we've
        arrived at them via a word character or not.  The current compiler is
        not really equipped for this kind of transformation.  For {start,end}
        of {line,file} assertions, we'd need to introduce dummy characters
        indicating start/end of input (say, 0x10000 and 0x10001) which we feed
        to the interpreter before respectively after the actual input.
        I suspect that this approach wouldn't make much of a difference for
        NFA execution. It would likely speed up (lazy) DFA execution though
        because assertions would be dealt with in the fast path.
      
      Cq-Include-Trybots: luci.v8.try:v8_linux64_fyi_rel_ng
      Bug: v8:10765
      Change-Id: Ic2012c943e0ce54eb8662789fb3d4c1b6cd8d606
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2398644
      Commit-Queue: Martin Bidlingmaier <mbid@google.com>
      Reviewed-by: 's avatarLeszek Swirski <leszeks@chromium.org>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#70026}
      e83511c2
  4. 16 Sep, 2020 1 commit
    • Martin Bidlingmaier's avatar
      [regexp] Support capture groups in experimental engine · 98b8ca89
      Martin Bidlingmaier authored
      This commit adds support for capture groups (as in e.g. /x(123|abc)y/)
      in the experimental regexp engine.  Now every InterpreterThread owns a
      register array containing (sub)match boundaries. There is a new
      instruction to record the current input index in some register.
      
      Submatches in quantifier bodies should be reported only if they occur
      during the last repetition.  Thus we reset those registers before
      attempting to match the body of a quantifier.  This is implemented with
      another new instruction.
      
      Because of concerns for the growing sizeof the NfaInterpreter object
      (which is allocated on the stack), this commit replaces the
      `SmallVector` members of the NfaInterpreter with zone-allocated arrays.
      Register arrays, which for a fixed regexp are all the same size, are
      allocated with a RecyclingZoneAllocator for cheap memory reclamation via
      a linked list of equally-sized free blocks.
      
      Possible optimizations for management of register array memory:
      1. If there are few register per thread, then it is likely faster to
         store them inline in the InterpreterThread struct.
      2. re2 implements copy-on-write:  InterpreterThreads can share the same
         register array. If a thread attempts to write to shared register
         array, the register array is cloned first.
      3. The register at index 1 contains the end of the match; this is only
         written to right before an ACCEPT statement.  We could make ACCEPT
         equivalent to what's currently CAPTURE 1 followed by ACCEPT.  We
         could then save the memory for register 1 for threads that haven't
         finished yet.  This is particularly interesting if now optimization 1
         kicks in.
      
      Cq-Include-Trybots: luci.v8.try:v8_linux64_fyi_rel_ng
      Bug: v8:10765
      Change-Id: I2c0503206ce331e13ac9912945bb66736d740197
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2390770
      Commit-Queue: Martin Bidlingmaier <mbid@google.com>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#69929}
      98b8ca89
  5. 09 Sep, 2020 1 commit
    • Martin Bidlingmaier's avatar
      [regexp] Support more quantifiers in experimental engine · f2a832ca
      Martin Bidlingmaier authored
      Previously to this commit only quantifiers of the form /<x>*/, i.e.
      arbitrarily often greedy repetition, were implemented.  Now a much
      larger class is supported, e.g. + and ? and their non-greedy variants.
      Because it came up repeatedly during the implementation, the commit also
      adds the Label and DeferredLabel classes to patch JMP and FORK target
      addresses more easily.
      
      Still not supported are the following quantifiers:
      - Possessive quantifiers, where I'm not entirely sure whether they could
        be implemented in principle. Re2 doesn't support them.
      - Quantifiers with large but finite numbers for min and max numbers of
        repetitions, as in e.g. /<x>{9000, 90000}/. These are currently
        limited to some small value. This is because the body of such
        repetitions is unrolled explicitly, so the size of the bytecode is
        linear in the number of repetitions.
      
      Cq-Include-Trybots: luci.v8.try:v8_linux64_fyi_rel_ng
      Bug: v8:10765
      Change-Id: Id04d893252588abb0f80c3cb33cfc707f6601ea0
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2387575
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#69759}
      f2a832ca
  6. 02 Sep, 2020 1 commit
    • Martin Bidlingmaier's avatar
      [regexp] Fix usage of {Is,Mark}PcProcessed in NfaInterpreter · 943f78a4
      Martin Bidlingmaier authored
      Previously we checked whether a thread's pc IsPcProcessed before pushing
      to the stack of (postponed) active_threads_.  This commit moves the
      IsPcProcessed check and corresponding MarkPcProcessed call to when the
      thread is actually processed, i.e. when it is popped from the
      active_threads_ stack again.
      
      This fixes two issues:
      - Consider what used to happen in the following scenario:
      1. An active thread t is postponed (e.g. because it is a fork) and
       pushed on active_threads_.  IsPcProcessed(t.pc) is false, so t is
       not discarded and does actually end up on active_threads_.
      2. Some other thread s is executed, and at some point s.pc == t.pc,
       i.e. t.pc is marked as processed.
      3. t is popped from active_threads_ for processing.
      
      In 3 we don't want to continue execution of t: After all, its pc is
      already marked as processed.  But because previously we only checked
      for IsPcProcessed in step 1 before pushing to active_threads_, we used
      to continue execution in 3.  I don't think this is a correctness
      issue, but possibly a performance problem.  In any case, this commit
      moves the IsPcProcessed check from 1 to 3 and so fixes this.
      - After flushing blocked_threads_, we push them to active_threads_
      again.  While doing so, we used to mark these thread's pcs as processed.
      This meant that sometimes a (fork of a) high priority thread was
      cancelled by the IsPcProcessed check even though its pc was only
      marked as processed by a thread with lower priority during flushing.
      We need it to be the other way round:  The low priority thread should
      be cancelled after its pc is processed by a thread with higher
      priority.
      With this commit we don't MarkPcProcessed during flushing, it's
      postponed to when we're actually processing.  This was a correctness
      issue, and there's a new corresponding test case.
      
      
      Bug: v8:10765
      Change-Id: Ie12682cf3f8a04222d907edd8a3ad25baa69465a
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2388112
      Commit-Queue: Martin Bidlingmaier <mbid@google.com>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#69668}
      943f78a4
  7. 31 Aug, 2020 1 commit