1. 14 Jan, 2021 1 commit
  2. 24 Nov, 2020 1 commit
  3. 09 Nov, 2020 1 commit
  4. 10 Jul, 2020 1 commit
  5. 10 Jun, 2020 3 commits
  6. 03 Jun, 2020 1 commit
  7. 28 Apr, 2020 1 commit
    • Iain Ireland's avatar
      [regexp] Handlify RegExpCompileData::code · 6bb3f0c0
      Iain Ireland authored
      RegExpMacroAssembler::GetCode returns a Handle<Object>. However, that
      Handle is almost immediately dereferenced, and is stored as a bare
      Object in both RegExpCompiler::CompilationResult and RegExpCompileData.
      
      This makes SpiderMonkey's rooting hazard analysis somewhat
      antsy. While RegExpCompileData is alive on the stack, the hazard
      analysis will not allow any calls that might GC, because it isn't
      smart enough to prove that the code field can't be clobbered by a GC.
      
      As far as I can tell, there is no real hazard here, but storing a
      Handle in RegExpCompileData instead of a bare Object will simplify SM
      and prevent a future patch from accidentally breaking something.
      
      Bug: v8:10406
      Change-Id: I9642dd05c591bfd23b340a89df2f2bf5c9fcac2c
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2161578Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#67441}
      6bb3f0c0
  8. 21 Apr, 2020 2 commits
    • Jakob Gruber's avatar
      [regexp] Consistent expectations for output registers · fe609139
      Jakob Gruber authored
      ... between the interpreter and generated code.
      
      Prior to this CL, pre- and post conditions on the output register
      array differed between the interpreter and generated code.
      
      Interpreter
      Pre: `output` fits captures and temporary registers.
      Post: None.
      
      Generated code
      Pre:  `output` fits capture registers.
      Post: `output` is modified if and only if the match succeeded.
      
      This CL changes the interpreter to match generated code pre- and
      post conditions by allocating space for temporary registers inside
      the interpreter.
      
      Drive-by: Add MaxRegisterCount, RegistersForCaptureCount helpers.
      
      Bug: chromium:1067270
      Change-Id: I2900ef2f31207d817ec7ead3e0e2215b23b398f0
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2135642
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarLeszek Swirski <leszeks@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#67268}
      fe609139
    • Iain Ireland's avatar
      [regexp] Factor out PreprocessRegExp · 58ac66b7
      Iain Ireland authored
      RegExpImpl::Compile does a number of transformations that require
      directly manipulating the internal representation of the regexp. For
      example, when matching a (non-sticky, non-anchored) regular
      expression, the pattern must be wrapped in .* so that it can match
      anywhere in the input.
      
      In the interest of moving towards a cleaner division between irregexp
      and the outside world, it makes sense to move this code into
      RegExpCompiler.
      
      R=jgruber@chromium.org
      
      Bug: v8:10406
      Change-Id: I6da251c91c0016914a51480f80bb46c337fd0b23
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2140246Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#67262}
      58ac66b7
  9. 19 Mar, 2020 3 commits
    • Iain Ireland's avatar
      Reland "[regexp] Rewrite error handling" · 560f2d8b
      Iain Ireland authored
      This is a reland of e80ca24c
      
      Original change's description:
      > [regexp] Rewrite error handling
      >
      > This patch modifies irregexp's error handling. Instead of representing
      > errors as C strings, they are represented as an enumeration value
      > (RegExpError), and only converted to strings when throwing the error
      > object in regexp.cc. This makes it significantly easier to integrate
      > into SpiderMonkey. A few notes:
      >
      > 1. Depending on whether the stack overflows during parsing or
      >    analysis, the stack overflow message can vary ("Stack overflow" or
      >    "Maximum call stack size exceeded"). I kept that behaviour in this
      >    patch, under the assumption that stack overflow messages are
      >    (sadly) the sorts of things that real world code ends up depending
      >    on.
      >
      > 2. Depending on the point in code where the error was identified,
      >    invalid unicode escapes could be reported as "Invalid Unicode
      >    escape", "Invalid unicode escape", or "Invalid Unicode escape
      >    sequence". I fervently hope that nobody depends on the specific
      >    wording of a syntax error, so I standardized on the first one. (It
      >    was both the most common, and the most consistent with other
      >    "Invalid X escape" messages.)
      >
      > 3. In addition to changing the representation, this patch also adds an
      >    error_pos field to RegExpParser and RegExpCompileData, which stores
      >    the position at which an error occurred. This is used by
      >    SpiderMonkey to provide more helpful messages about where a syntax
      >    error occurred in large regular expressions.
      >
      > 4. This model is closer to V8's existing MessageTemplate
      >    infrastructure. I considered trying to integrate it more closely
      >    with MessageTemplate, but since one of our stated goals for this
      >    project was to make it easier to use irregexp outside of V8, I
      >    decided to hold off.
      >
      > R=jgruber@chromium.org
      >
      > Bug: v8:10303
      > Change-Id: I62605fd2def2fc539f38a7e0eefa04d36e14bbde
      > Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2091863
      > Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      > Reviewed-by: Jakob Gruber <jgruber@chromium.org>
      > Cr-Commit-Position: refs/heads/master@{#66784}
      
      R=jgruber@chromium.org
      
      Bug: v8:10303
      Change-Id: Iad1f11a0e0b9e525d7499aacb56c27eff9e7c7b5
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2109952Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#66798}
      560f2d8b
    • Leszek Swirski's avatar
      Revert "[regexp] Rewrite error handling" · 2193f691
      Leszek Swirski authored
      This reverts commit e80ca24c.
      
      Reason for revert: Causes failures in the fast/regex/non-pattern-characters.html Blink web test (https://ci.chromium.org/p/v8/builders/ci/V8%20Blink%20Linux/3679)
      
      Original change's description:
      > [regexp] Rewrite error handling
      > 
      > This patch modifies irregexp's error handling. Instead of representing
      > errors as C strings, they are represented as an enumeration value
      > (RegExpError), and only converted to strings when throwing the error
      > object in regexp.cc. This makes it significantly easier to integrate
      > into SpiderMonkey. A few notes:
      > 
      > 1. Depending on whether the stack overflows during parsing or
      >    analysis, the stack overflow message can vary ("Stack overflow" or
      >    "Maximum call stack size exceeded"). I kept that behaviour in this
      >    patch, under the assumption that stack overflow messages are
      >    (sadly) the sorts of things that real world code ends up depending
      >    on.
      > 
      > 2. Depending on the point in code where the error was identified,
      >    invalid unicode escapes could be reported as "Invalid Unicode
      >    escape", "Invalid unicode escape", or "Invalid Unicode escape
      >    sequence". I fervently hope that nobody depends on the specific
      >    wording of a syntax error, so I standardized on the first one. (It
      >    was both the most common, and the most consistent with other
      >    "Invalid X escape" messages.)
      > 
      > 3. In addition to changing the representation, this patch also adds an
      >    error_pos field to RegExpParser and RegExpCompileData, which stores
      >    the position at which an error occurred. This is used by
      >    SpiderMonkey to provide more helpful messages about where a syntax
      >    error occurred in large regular expressions.
      > 
      > 4. This model is closer to V8's existing MessageTemplate
      >    infrastructure. I considered trying to integrate it more closely
      >    with MessageTemplate, but since one of our stated goals for this
      >    project was to make it easier to use irregexp outside of V8, I
      >    decided to hold off.
      > 
      > R=​jgruber@chromium.org
      > 
      > Bug: v8:10303
      > Change-Id: I62605fd2def2fc539f38a7e0eefa04d36e14bbde
      > Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2091863
      > Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      > Reviewed-by: Jakob Gruber <jgruber@chromium.org>
      > Cr-Commit-Position: refs/heads/master@{#66784}
      
      TBR=jgruber@chromium.org,iireland@mozilla.com
      
      Change-Id: I9247635f3c5b17c943b9c4abaf82ebe7b2de165e
      No-Presubmit: true
      No-Tree-Checks: true
      No-Try: true
      Bug: v8:10303
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2108550Reviewed-by: 's avatarLeszek Swirski <leszeks@chromium.org>
      Commit-Queue: Leszek Swirski <leszeks@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#66786}
      2193f691
    • Iain Ireland's avatar
      [regexp] Rewrite error handling · e80ca24c
      Iain Ireland authored
      This patch modifies irregexp's error handling. Instead of representing
      errors as C strings, they are represented as an enumeration value
      (RegExpError), and only converted to strings when throwing the error
      object in regexp.cc. This makes it significantly easier to integrate
      into SpiderMonkey. A few notes:
      
      1. Depending on whether the stack overflows during parsing or
         analysis, the stack overflow message can vary ("Stack overflow" or
         "Maximum call stack size exceeded"). I kept that behaviour in this
         patch, under the assumption that stack overflow messages are
         (sadly) the sorts of things that real world code ends up depending
         on.
      
      2. Depending on the point in code where the error was identified,
         invalid unicode escapes could be reported as "Invalid Unicode
         escape", "Invalid unicode escape", or "Invalid Unicode escape
         sequence". I fervently hope that nobody depends on the specific
         wording of a syntax error, so I standardized on the first one. (It
         was both the most common, and the most consistent with other
         "Invalid X escape" messages.)
      
      3. In addition to changing the representation, this patch also adds an
         error_pos field to RegExpParser and RegExpCompileData, which stores
         the position at which an error occurred. This is used by
         SpiderMonkey to provide more helpful messages about where a syntax
         error occurred in large regular expressions.
      
      4. This model is closer to V8's existing MessageTemplate
         infrastructure. I considered trying to integrate it more closely
         with MessageTemplate, but since one of our stated goals for this
         project was to make it easier to use irregexp outside of V8, I
         decided to hold off.
      
      R=jgruber@chromium.org
      
      Bug: v8:10303
      Change-Id: I62605fd2def2fc539f38a7e0eefa04d36e14bbde
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2091863
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#66784}
      e80ca24c
  10. 17 Mar, 2020 1 commit
  11. 16 Mar, 2020 1 commit
  12. 12 Mar, 2020 3 commits
    • Iain Ireland's avatar
      [regexp] Upstream small changes · a2b17a72
      Iain Ireland authored
      This is a grab-bag of small compatibility fixes to make it easier to
      import irregexp into SpiderMonkey. For changes where the commit
      message was longer than the change itself, it didn't seem worth
      opening a separate review.
      
      [regexp] Use uc16 in FilterOneByte
      
      SpiderMonkey uses char16_t instead of uint16_t for its two-byte
      strings. (This matches ICU. It looks like V8 considered making the
      same change, but decided against it: see
      https://bugs.chromium.org/p/v8/issues/detail?id=6487.) Fortunately,
      irregexp is careful about only using uc16, so SpiderMonkey can just
      define uc16 = char16_t and *almost* everything works out. This patch
      fixes the single place in irregexp where that is not true.
      
      [regexp] Remove unreachable return
      
      The return statement at the end of
      RegExpParser::ParseClassCharacterEscape is unreachable, because every
      branch of the switch returns. This triggered static analysis errors in
      SpiderMonkey.
      
      [regexp] Remove trivial assertion
      
      The assertion in BytecodeSequenceNode::ArgumentMapping cannot fail,
      because size_t is an unsigned type. This triggered static analysis
      warnings in SpiderMonkey.
      
      [regexp] Make RegExpStack constructor public
      
      In V8, the RegExpStack's private constructor is called from Isolate,
      which is a friend class. In SpiderMonkey, we use a wrapper around new
      to control where memory is allocated, so we need the RegExpStack
      constructor to be visible outside of Isolate.
      
      [regexp] Refactor Isolate::IncreaseTotalRegexpCodeGenerated
      
      The call-site of Isolate::IncreaseTotalRegexpCodeGenerated is the only
      place inside irregexp where HeapObject::Size is called. SpiderMonkey's
      heap-allocated objects live in arenas, and don't have a standardized
      way of finding the size. In this particular case it would be safe to
      hardcode a size of 0, but leaving HeapObject::Size undefined will
      ensure that SpiderMonkey doesn't silently do the wrong thing if
      somebody in V8 adds a new, more meaningful call to HeapObject::Size.
      
      R=jgruber@chromium.org
      
      Bug: v8:10303
      Change-Id: I5b81e1a261fec8c85a63f71f34cd12d68f638334
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2090191
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#66676}
      a2b17a72
    • Iain Ireland's avatar
      [regexp] Use ZoneVector in parser and compiler · 5b44c169
      Iain Ireland authored
      For a variety of reasons related to OOM handling and custom
      allocators, SpiderMonkey wants to be able to see all memory
      allocations. To enforce this, we have a static analysis that verifies
      that we don't link in malloc/new/etc in unexpected places. One
      consequence of this is that we can't use STL containers without a
      custom allocator, because they call operator new internally.
      
      This is mostly not an issue in irregexp, which makes heavy use of zone
      allocation. The main exceptions are a handful of uses of std::vector
      in regexp-compiler.* and regexp-parser.*. If these vectors are
      converted to ZoneVectors, then our static analysis is satisfied.
      
      R=jgruber@chromium.org
      
      Bug: v8:10303
      Change-Id: I8b14a2eb54d3b20959e3fbe878f77effae124a2c
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2091402Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#66674}
      5b44c169
    • Iain Ireland's avatar
      [regexp] Remove function pointer in TextEmitPass · 73da478c
      Iain Ireland authored
      TextEmitPass uses a function pointer to determine which pass to
      call. This function pointer is only assigned inside TextEmitPass, and
      can easily be eliminated by moving the calls to each possible target
      inside the switch statement that assigns the function pointer.
      
      I made this change because SpiderMonkey uses a static analysis pass to
      verify that everything is rooted properly across calls that might GC,
      and that analysis is conservative when calling function pointers. We
      can white-list function pointers that are known to be safe, but the
      code being called through this function pointer is complex enough
      (and the function pointer is unnecessary enough) that it seemed best
      to just remove the function pointer entirely.
      
      R=jgruber@chromium.org
      
      Bug: v8:10303
      Change-Id: I5fbb0df290a2288c4d3db6d43a563385337162ea
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2091398
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#66672}
      73da478c
  13. 10 Mar, 2020 1 commit
    • Iain Ireland's avatar
      [regexp] Fix and unify non-unicode case-folding algorithms · 3fab9d05
      Iain Ireland authored
      Non-unicode, case-insensitive regexps (e.g. /foo/i, not foo/iu) use a
      case-folding algorithm that doesn't quite match the Unicode
      definition. There are two places in irregexp that need to do
      case-folding. Prior to this patch, neither of them quite matched the
      spec (https://tc39.es/ecma262/#sec-runtime-semantics-canonicalize-ch).
      
      This patch implements the "Canonicalize" algorithm in
      src/regexp/special-case.h, and uses it in the relevant places. It
      replaces special-case logic around upper-casing / ASCII characters
      with the following approach:
      
      1. For most characters, calling UnicodeSet::closeOver on a set
         containing that character will produce the correct set of
         case-insensitive matches.
      
      2. For a small handful of characters (like the sharp S that prompted
         this change), UnicodeSet::closeOver will include some characters
         that should be omitted. For example, although closeOver('ß') =
         "ßẞ", uppercase('ß') is "SS", so step 3.e means that 'ß'
         canonicalizes to itself, and should not match 'ẞ'. In these cases,
         we can skip the closeOver entirely, because it will never add an
         equivalent character. These characters are in the IgnoreSet.
      
      3. For an even smaller handful of characters, UnicodeSet::closeOver
         will produce some characters that should be omitted, but also some
         characters that should be included. For example, closeOver('k') =
         "kKK" (lowercase k, uppercase K, U+212A KELVIN SIGN), but KELVIN
         SIGN should not match either of the other two (step 3.g). To handle
         this, we put such characters in the SpecialAddSet. In these cases,
         we closeOver the original character, but filter out the results
         that do not have the same canonical value.
      
      The computation of IgnoreSet and SpecialAddSet happens at build time,
      using the pre-existing gen-regexp-special-case.cc step.
      
      R=jgruber@chromium.org
      
      Bug: v8:10248
      Change-Id: I00d48b180c83bb8e645cc59eda57b01eab134f0b
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2072858Reviewed-by: 's avatarFrank Tang <ftang@chromium.org>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#66641}
      3fab9d05
  14. 02 Mar, 2020 2 commits
  15. 09 Oct, 2019 1 commit
  16. 27 Sep, 2019 2 commits
  17. 29 Aug, 2019 2 commits
    • Jakob Gruber's avatar
      [regexp] Add an offset argument CheckAtStart · 2e0bc516
      Jakob Gruber authored
      Similar to CheckNotAtStart, one can now apply an offset to the
      CheckAtStart operation. Due to a recent change, all callsites of
      CheckNotAtStart now need to pass an offset, whereas previously the
      offset was just assumed to be zero.
      
      Bug: chromium:996391
      Change-Id: Ia59a584e93e5384479f05abddef7859b420b023a
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1773272
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Auto-Submit: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarPeter Marshall <petermarshall@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#63444}
      2e0bc516
    • Jakob Gruber's avatar
      [regexp] Add dedicated flags for printing regexp code and bytecode · eebb18d3
      Jakob Gruber authored
      Printing regexp code used to behind the generic --print-code flag, but
      there was no way to distinguish between irregexp-generated code; and
      printing regexp bytecode was not supported at all (the
      --trace-regexp-bytecodes flag *did* exist, but prints the execution
      trace at runtime and not the generated bytecode sequence).
      
      This CL adds two new flags:
      
      --print-regexp-code
      --print-regexp-bytecode
      
      Regexp code is no longer printed as part of --print-code.
      
      Example output for --print-regexp-bytecode:
      
      generated bytecode for regexp pattern: .(?<!^.)
      0x1ddcc614cbd0     0  PUSH_BT, 02, 00, 00, 00, c0, 00, 00, 00 .......
      0x1ddcc614cbd8     8  LOAD_CURRENT_CHAR, 11, 00, 00, 00, b0, 00, 00, 00 .......
      0x1ddcc614cbe0    10  CHECK_CHAR, 18, 0a, 00, 00, b0, 00, 00, 00 .......
      0x1ddcc614cbe8    18  CHECK_CHAR, 18, 0d, 00, 00, b0, 00, 00, 00 .......
      0x1ddcc614cbf0    20  PUSH_CP, 01, 00, 00, 00 ...
      
      Bug: chromium:996391
      Change-Id: I731defbd7cf9ed29753a39bb1d7205dc136ca950
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1773249
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Auto-Submit: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarPeter Marshall <petermarshall@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#63442}
      eebb18d3
  18. 28 Aug, 2019 1 commit
    • Jakob Gruber's avatar
      [regexp] Dont attempt to match '^' before the start of the string · 1990b1e1
      Jakob Gruber authored
      This fixes an invalid assumption when emitting code for matching '^'
      (start of line) in multiline regexps and '\b', '\B' in general.
      
      What we used to do: if the current trace's cp_offset (the offset from
      the current position) was non-zero, we assumed that we were looking at
      subject string index 1 or greater (i.e.: not at the start of the string
      or before).
      
      This is no longer valid since cp_offsets can now be negative.
      
      This CL changes the logic to omit start- and bounds-checks only for
      strictly positive cp_offsets, where the above assumption still holds.
      
      Bug: chromium:996391
      Change-Id: I79be4fc295c6f0b63e41c13d1e91fdd00f2f2b42
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1771794
      Commit-Queue: Erik Corry <erikcorry@chromium.org>
      Auto-Submit: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarErik Corry <erikcorry@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#63424}
      1990b1e1
  19. 23 Aug, 2019 1 commit
  20. 31 Jul, 2019 1 commit
    • Seth Brenith's avatar
      Reland "[regexp] Better quick checks on loop entry nodes" · bea0ffd0
      Seth Brenith authored
      This is a reland of 4b15b984
      
      Updates since original: fix an arithmetic overflow bug, remove an invalid
      DCHECK, add a unit test that would trigger that DCHECK.
      
      Original change's description:
      > [regexp] Better quick checks on loop entry nodes
      >
      > Like the predecessor change https://crrev.com/c/v8/v8/+/1702125 , this
      > change is inspired by attempting to exit earlier from generated RegExp
      > code, when no further matches are possible because any match would be
      > too long. The motivating example this time is the following expression,
      > which tests whether a string of Unicode playing cards has five of the
      > same suit in a row:
      >
      > /([🂡-🂮]{5})|([🂱-🂾]{5})|([🃁-🃎]{5})|([🃑-🃞]{5})/u
      >
      > A human reading this expression can readily see that any match requires
      > at least 10 characters (5 surrogate pairs), but the LoopChoiceNode for
      > each repeated option reports its minimum distance to the end of a match
      > as zero. This is correct, because the LoopChoiceNode's behavior depends
      > on additional state (the loop counter). However, the preceding node, a
      > SET_REGISTER action that initializes the loop counter, could confidently
      > state that it consumes at least 10 characters. Furthermore, when we try
      > to emit a quick check for that action, we could follow only paths from
      > the LoopChoiceNode that are possible based on the minimum iteration
      > count. This change implements both of those "could"s.
      >
      > I expect this improvement to apply pretty broadly to expressions that
      > use minimum repetition counts and that don't meet the criteria for
      > unrolling. In this particular case, I get about 12% improvement on the
      > overall UniPoker test, due to reducing the execution time of this
      > expression by 85% and the execution time of another similar expression
      > that checks for n-of-a-kind by 20%.
      >
      > Bug: v8:9305
      >
      > Change-Id: I319e381743967bdf83324be75bae943fbb5dd496
      > Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1704941
      > Commit-Queue: Seth Brenith <seth.brenith@microsoft.com>
      > Reviewed-by: Jakob Gruber <jgruber@chromium.org>
      > Cr-Commit-Position: refs/heads/master@{#62963}
      
      Bug: v8:9305
      Change-Id: I992070d383009013881bf778242254c27134b650
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1726674Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Commit-Queue: Seth Brenith <seth.brenith@microsoft.com>
      Cr-Commit-Position: refs/heads/master@{#63009}
      bea0ffd0
  21. 30 Jul, 2019 1 commit
    • Leszek Swirski's avatar
      Revert "[regexp] Better quick checks on loop entry nodes" · 51afbd1a
      Leszek Swirski authored
      This reverts commit 4b15b984.
      
      Reason for revert: UBSan failure (https://logs.chromium.org/logs/v8/buildbucket/cr-buildbucket.appspot.com/8906578530303352544/+/steps/Check/0/logs/regress-126412/0).
      
      Original change's description:
      > [regexp] Better quick checks on loop entry nodes
      > 
      > Like the predecessor change https://crrev.com/c/v8/v8/+/1702125 , this
      > change is inspired by attempting to exit earlier from generated RegExp
      > code, when no further matches are possible because any match would be
      > too long. The motivating example this time is the following expression,
      > which tests whether a string of Unicode playing cards has five of the
      > same suit in a row:
      > 
      > /([🂡-🂮]{5})|([🂱-🂾]{5})|([🃁-🃎]{5})|([🃑-🃞]{5})/u
      > 
      > A human reading this expression can readily see that any match requires
      > at least 10 characters (5 surrogate pairs), but the LoopChoiceNode for
      > each repeated option reports its minimum distance to the end of a match
      > as zero. This is correct, because the LoopChoiceNode's behavior depends
      > on additional state (the loop counter). However, the preceding node, a
      > SET_REGISTER action that initializes the loop counter, could confidently
      > state that it consumes at least 10 characters. Furthermore, when we try
      > to emit a quick check for that action, we could follow only paths from
      > the LoopChoiceNode that are possible based on the minimum iteration
      > count. This change implements both of those "could"s.
      > 
      > I expect this improvement to apply pretty broadly to expressions that
      > use minimum repetition counts and that don't meet the criteria for
      > unrolling. In this particular case, I get about 12% improvement on the
      > overall UniPoker test, due to reducing the execution time of this
      > expression by 85% and the execution time of another similar expression
      > that checks for n-of-a-kind by 20%.
      > 
      > Bug: v8:9305
      > 
      > Change-Id: I319e381743967bdf83324be75bae943fbb5dd496
      > Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1704941
      > Commit-Queue: Seth Brenith <seth.brenith@microsoft.com>
      > Reviewed-by: Jakob Gruber <jgruber@chromium.org>
      > Cr-Commit-Position: refs/heads/master@{#62963}
      
      TBR=jgruber@chromium.org,seth.brenith@microsoft.com
      
      Change-Id: Iac085b75e054fdf0d218987cfe449be1f1630545
      No-Presubmit: true
      No-Tree-Checks: true
      No-Try: true
      Bug: v8:9305
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1725621Reviewed-by: 's avatarLeszek Swirski <leszeks@chromium.org>
      Commit-Queue: Leszek Swirski <leszeks@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#62977}
      51afbd1a
  22. 29 Jul, 2019 1 commit
    • Seth Brenith's avatar
      [regexp] Better quick checks on loop entry nodes · 4b15b984
      Seth Brenith authored
      Like the predecessor change https://crrev.com/c/v8/v8/+/1702125 , this
      change is inspired by attempting to exit earlier from generated RegExp
      code, when no further matches are possible because any match would be
      too long. The motivating example this time is the following expression,
      which tests whether a string of Unicode playing cards has five of the
      same suit in a row:
      
      /([🂡-🂮]{5})|([🂱-🂾]{5})|([🃁-🃎]{5})|([🃑-🃞]{5})/u
      
      A human reading this expression can readily see that any match requires
      at least 10 characters (5 surrogate pairs), but the LoopChoiceNode for
      each repeated option reports its minimum distance to the end of a match
      as zero. This is correct, because the LoopChoiceNode's behavior depends
      on additional state (the loop counter). However, the preceding node, a
      SET_REGISTER action that initializes the loop counter, could confidently
      state that it consumes at least 10 characters. Furthermore, when we try
      to emit a quick check for that action, we could follow only paths from
      the LoopChoiceNode that are possible based on the minimum iteration
      count. This change implements both of those "could"s.
      
      I expect this improvement to apply pretty broadly to expressions that
      use minimum repetition counts and that don't meet the criteria for
      unrolling. In this particular case, I get about 12% improvement on the
      overall UniPoker test, due to reducing the execution time of this
      expression by 85% and the execution time of another similar expression
      that checks for n-of-a-kind by 20%.
      
      Bug: v8:9305
      
      Change-Id: I319e381743967bdf83324be75bae943fbb5dd496
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1704941
      Commit-Queue: Seth Brenith <seth.brenith@microsoft.com>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#62963}
      4b15b984
  23. 25 Jul, 2019 1 commit
    • Seth Brenith's avatar
      [regexp] Use stricter bounds check to avoid additional iteration · f42b1a5d
      Seth Brenith authored
      The motivating example is JetStream 2's UniPoker test, which tests
      whether a sorted string of Unicode playing cards contains a five-card
      straight using a regular expression. In the top-level generated loop for
      this RegExp, we see this loop exit condition:
      
      00000350000C2067    27  83fffe         cmpl rdi,0xfe
      00000350000C206A    2a  0f8da8e40000   jge 00000350000D0518  <+0xe4d8>
      
      Meaning if the current position is pointing at the very last (16-bit)
      character, then we exit the loop. Otherwise we go on and try to find
      various matches starting at the current position. However, we can see
      in the original expression that any possible match is at least 10
      characters (5 astral-plane Unicode values), so we're wasting a lot of
      time attempting to find matches in cases where we're too close to the
      end of the string for any match to succeed.
      
      This example might be a bit contrived, but I expect that an improvement
      in this bounds check would help a larger family of regular expressions,
      where the minimum match length is large relative to the string being
      matched and we don't meet the other necessary criteria for fast Boyer-
      Moore lookahead.
      
      To get the desired bounds check in this case, this patch does the
      following:
      1. Compute accurate EatsAtLeast values for every node during the
         analysis phase. This could end up doing more work than the current
         implementation, but analysis already has to touch every node, so it
         seems like a cache-friendly time to compute these values. In some
         cases, this might be less total work than the current implementation,
         because the current implementation might recompute the same node
         multiple times.
      2. When emitting a quick check, use the EatsAtLeast value from the
         predecessor ChoiceNode for the bounds check.
      
      This improves the UniPoker score on my machine by about 4%, because it
      cuts the time spent checking for straights roughly in half, and checking
      for straights originally accounted for about 8% of the total time.
      
      Bug: v8:9305
      Change-Id: I110b190c2578f73b2263259d5aa5750e921b01be
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1702125
      Commit-Queue: Seth Brenith <seth.brenith@microsoft.com>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#62919}
      f42b1a5d
  24. 01 Jul, 2019 1 commit
    • Jakob Gruber's avatar
      [regexp] Fix BoyerMooreLookahead behavior at submatches · bc4cbe92
      Jakob Gruber authored
      Since https://codereview.chromium.org/2777583003, the Boyer-Moore
      lookahead (used by the irregexp engine) also looks inside submatches
      to narrow down its range of accepted characters at specific offsets.
      
      But the end of a submatch, designated by a PositiveSubmatchSuccess
      action node, was not handled correctly. When a submatch terminates,
      we have no knowledge of what may follow, and thus must accept any
      character at following positions. This is done by the SetRest call
      added in this CL.
      
      An example, since this is fairly obscure:
      
      /^.*?Y(((?=B?).)*)Y$/s
      
      The initial non-greedy loop, together with the s flag,
      will trigger an attempted Boyer-Moore lookahead. After this follows
      an unconditional Y, a *-quantified loop matching any char and
      containing a lookahead that matches either 1 B or 0 B's, and an
      unconditional trailing Y.
      
      When the BM lookahead scans the subject string for the beginning of
      this pattern after the non-greedy loop, it should look for: a Y at
      offset 0, and either a B, a Y, or '.' (-> any character) at offset 1.
      
      Prior to this CL this was not the case:
      
      - The lookaround is internally generated as a submatch.
      - The optional 'B?' is unrolled into 'either B followed by submatch
        end' or 'submatch end'.
      - Filling in BM infos terminates when encountering a submatch end.
        Thus in the former case we added B to the set of accepted characters
        and terminated, while in the latter case we simply terminated.o
      
      This CL ensures that BM will accept any character at any offset at or
      exceeding the first encountered submatch end.
      
      Bug: v8:8770
      Change-Id: Iff998ba307cd9669203846a9182798b8cf6a85dc
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1679506
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarErik Corry <erikcorry@chromium.org>
      Reviewed-by: 's avatarYang Guo <yangguo@chromium.org>
      Auto-Submit: Jakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#62460}
      bc4cbe92
  25. 27 Jun, 2019 1 commit
    • Jakob Gruber's avatar
      [regexp] Refactor BoyerMoorePositionInfo uses · ad68a376
      Jakob Gruber authored
      BoyerMoorePositionInfo is a simple wrapper around a bitset and an
      associated ContainedInLattice field. This CL refactors bitset-related
      operations that used to be implemented naively (e.g.: loop over all
      bits to find a single set bit, or to generate a union of two bitsets).
      
      Instead, use more suitable methods from base::bits and std::bitset.
      
      Drive-by: Remove dead class members.
      Drive-by: Zero the ByteArray with memset.
      
      Bug: v8:9359
      Change-Id: I33923c7d216320f4e3d7e4a6df2967f4aa86ab05
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1667407
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarLeszek Swirski <leszeks@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#62419}
      ad68a376
  26. 19 Jun, 2019 2 commits
  27. 18 Jun, 2019 3 commits