1. 13 May, 2022 2 commits
  2. 20 Jan, 2022 1 commit
  3. 18 Jan, 2022 1 commit
  4. 01 Dec, 2021 1 commit
    • Jakob Gruber's avatar
      [regexp] Fix CharacterRange limits again again again · 2e17aaca
      Jakob Gruber authored
      When emitting code, character ranges must only specify ranges which
      the actual subject string (one- or two-byte) may contain.
      
      This was not always the case, specifically for ranges with
      `from <= kMaxUint8` and `to > kMaxUint8`.
      
      The reason this is so tricky: 1. not all parts of the pipeline know
      whether we are compiling for one- or two-byte subjects; 2. for
      case-insensitive regexps, an out-of-bounds CharacterRange may have an
      in-bounds case equivalent (e.g. /[Ÿ]/i also matches 'ÿ' == \u{ff}),
      which only gets added somewhere in the middle of the pipeline.
      
      Our current solution is to clamp immediately before code emission. We
      also keep the existing handling/dchecks of the 0x10ffff marker value
      which may occur in the two-byte subject case.
      
      Bug: v8:11069
      Change-Id: Ic7b34a13a900ea2aa3df032daac9236bf5682a42
      Fixed: chromium:1275096
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/3306569
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarLeszek Swirski <leszeks@chromium.org>
      Cr-Commit-Position: refs/heads/main@{#78186}
      2e17aaca
  5. 09 Nov, 2021 1 commit
  6. 26 Oct, 2021 1 commit
  7. 25 Oct, 2021 1 commit
    • Jakob Gruber's avatar
      [regexp] Only emit valid ranges in MakeRangeArray · b7dc9915
      Jakob Gruber authored
      Character class handling in the irregexp pipeline is quite complex;
      codepoints outside the BMP (basic multilingual plane) are only
      translated into surrogate pairs when needed, e.g. when the subject
      string is two-byte. If not needed, the codepoints simply stay part of
      the list of CharacterRanges.
      
      In EmitCharClass, we determine the valid subset of ranges through
      ranges_length; until this CL, we forgot to pass that information on to
      MakeRangeArray. Do that now by truncating the list of CharacterRanges.
      
      Fixed: chromium:1262423
      Change-Id: I5bb5b839e9935890ca2d10908ad66d72c3217178
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/3240782
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Auto-Submit: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarMathias Bynens <mathias@chromium.org>
      Cr-Commit-Position: refs/heads/main@{#77514}
      b7dc9915
  8. 19 Oct, 2021 1 commit
    • Jakob Gruber's avatar
      [regexp] Compact codegen for large character classes · 8bbb44e5
      Jakob Gruber authored
      Large character classes may easily be created when unicode
      properties (e.g.: /\p{L}/u and /\P{L}/u) are used - these are
      expanded internally into character classes that consist of hundreds
      of character ranges. Previously to this CL, we'd emit branching code
      for each of these ranges, leading to very large regexp code objects.
      
      This CL adds a new codegen mode for large character classes (where
      'large' currently means > 16 ranges). Instead of emitting branching
      code inline, the ranges are written into a ByteArray and we call into
      the C function IsCharacterInRangeArray for the actual branching logic.
      The ByteArray is smaller than emitted code and is deduplicated if the
      same character class is matched repeatedly in the same pattern.
      
      Note this mode is *not* implemented for the interpreter, since we
      currently don't have a constant pool for irregexp bytecode, and thus
      cannot reference ByteArrays.
      
      Bug: v8:11069
      Change-Id: I2d728e42d85114b796c637f791848731a104cd54
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/3229377Reviewed-by: 's avatarPatrick Thier <pthier@chromium.org>
      Auto-Submit: Jakob Gruber <jgruber@chromium.org>
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/main@{#77463}
      8bbb44e5
  9. 14 Oct, 2021 1 commit
  10. 12 Oct, 2021 1 commit
  11. 19 Aug, 2021 2 commits
  12. 10 Aug, 2021 1 commit
    • Jakob Gruber's avatar
      [regexp] Handle another regexp-too-big path for fuzzer suppressions · 3e21b6d0
      Jakob Gruber authored
      The behavior here depends on the platform and may also differ between
      fast and slow paths [0]. Crash to let the fuzzer know there's nothing
      interesting here.
      
      [0] The reason for the fast-slow-path difference is that sometimes we
      may trigger different compile jobs on these paths. One example is
      `split`, which creates a new regexp instance on the slow path, but
      reuses an existing instance on the fast path.
      
      Bug: chromium:1236845
      Change-Id: I87d9eb2601b235440014530d98df0e938b717650
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/3080577
      Auto-Submit: Jakob Gruber <jgruber@chromium.org>
      Commit-Queue: Michael Achenbach <machenbach@chromium.org>
      Reviewed-by: 's avatarMichael Achenbach <machenbach@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#76197}
      3e21b6d0
  13. 27 Jul, 2021 1 commit
  14. 01 Jul, 2021 1 commit
  15. 24 Jun, 2021 3 commits
  16. 18 Jun, 2021 1 commit
  17. 09 Jun, 2021 1 commit
    • Iain Ireland's avatar
      [regexp] Propagate eats_at_least for negative lookahead · 363ab5ae
      Iain Ireland authored
      In issue 11290, we disabled the propagation of EAL data out of
      lookarounds, because it was incorrect for lookahead nodes in
      loops. This caused performance regressions: for example,
      `/^\P{Letter}+$/u` (matching only characters that are not in Unicode's
      Letter category) uses negative lookahead when matching lone
      surrogates, and became about 2x slower. I spent some time looking into
      fixes, and this is what I've settled on.
      
      Some background: the implementation of lookarounds in irregexp is
      split between positive and negative lookaheads. (Lookbehinds aren't
      relevant here, because backwards matches always have EAL=0.)  Positive
      lookaheads are wrapped in BEGIN_SUBMATCH and POSITIVE_SUBMATCH_SUCCESS
      ActionNodes. BEGIN_SUBMATCH saves the current state.
      POSITIVE_SUBMATCH_SUCCESS restores the necessary state (while leaving
      any captures that occurred during the lookaround intact).
      
      Negative lookaheads also begin with a BEGIN_SUBMATCH node, but follow
      it with a NegativeLookaroundChoiceNode. This node has two successors:
      a lookaround node, and a continue node. It only executes the continue
      node if the lookaround node backtracks, which automatically restores
      the previous state. Negative lookarounds also can't update captures.
      
      This affects EAL calculations. It turns out that negative lookaheads
      are already doing the right thing: EatsAtLeastPropagator only
      propagates information from the continue node, ignoring the lookaround
      node. The same is true for quick checks (see the comment in
      RegExpLookaround:Builder::ForMatch). A BEGIN_SUBMATCH for a negative
      lookahead can simply propagate the EAL data from its successor like
      any other ActionNode, and everything works.
      
      Positive lookaheads are harder. I tried saving a pointer to the
      successor in BEGIN_SUBMATCH, but ran into problems in FillInBMInfo,
      because the EAL value corresponded to the nodes after the lookahead,
      but the analysis was still looking at the nodes inside. I fell back
      to a more modest approach: split BEGIN_SUBMATCH in two, and propagate
      EAL info for BEGIN_NEGATIVE_SUBMATCH while keeping the current
      behaviour for BEGIN_POSITIVE_SUBMATCH. This fixes the performance
      regression at hand.
      
      Two potential approaches for fixing EAL for positive lookahead are:
       1. Handling positive lookahead with its own dedicated choice node,
          like NegativeLookaroundChoiceNode.
       2. Adding an eats_at_least_inside_loop field to EatsAtLeastInfo,
          which is <= eats_at_least_from_possibly_start, and using that
          value in EatsAtLeastFromLoopEntry.
      
      Both of those approaches are more complex than I want to tackle
      right now, though.
      
      Bug: v8:11844
      Change-Id: I2a43509c2c21194b8c18f0a587fa21c194db76c2
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2934858Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#75031}
      363ab5ae
  18. 08 Apr, 2021 2 commits
  19. 08 Feb, 2021 1 commit
  20. 14 Jan, 2021 1 commit
  21. 24 Nov, 2020 1 commit
  22. 09 Nov, 2020 1 commit
  23. 10 Jul, 2020 1 commit
  24. 10 Jun, 2020 3 commits
  25. 03 Jun, 2020 1 commit
  26. 28 Apr, 2020 1 commit
    • Iain Ireland's avatar
      [regexp] Handlify RegExpCompileData::code · 6bb3f0c0
      Iain Ireland authored
      RegExpMacroAssembler::GetCode returns a Handle<Object>. However, that
      Handle is almost immediately dereferenced, and is stored as a bare
      Object in both RegExpCompiler::CompilationResult and RegExpCompileData.
      
      This makes SpiderMonkey's rooting hazard analysis somewhat
      antsy. While RegExpCompileData is alive on the stack, the hazard
      analysis will not allow any calls that might GC, because it isn't
      smart enough to prove that the code field can't be clobbered by a GC.
      
      As far as I can tell, there is no real hazard here, but storing a
      Handle in RegExpCompileData instead of a bare Object will simplify SM
      and prevent a future patch from accidentally breaking something.
      
      Bug: v8:10406
      Change-Id: I9642dd05c591bfd23b340a89df2f2bf5c9fcac2c
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2161578Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#67441}
      6bb3f0c0
  27. 21 Apr, 2020 2 commits
    • Jakob Gruber's avatar
      [regexp] Consistent expectations for output registers · fe609139
      Jakob Gruber authored
      ... between the interpreter and generated code.
      
      Prior to this CL, pre- and post conditions on the output register
      array differed between the interpreter and generated code.
      
      Interpreter
      Pre: `output` fits captures and temporary registers.
      Post: None.
      
      Generated code
      Pre:  `output` fits capture registers.
      Post: `output` is modified if and only if the match succeeded.
      
      This CL changes the interpreter to match generated code pre- and
      post conditions by allocating space for temporary registers inside
      the interpreter.
      
      Drive-by: Add MaxRegisterCount, RegistersForCaptureCount helpers.
      
      Bug: chromium:1067270
      Change-Id: I2900ef2f31207d817ec7ead3e0e2215b23b398f0
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2135642
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarLeszek Swirski <leszeks@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#67268}
      fe609139
    • Iain Ireland's avatar
      [regexp] Factor out PreprocessRegExp · 58ac66b7
      Iain Ireland authored
      RegExpImpl::Compile does a number of transformations that require
      directly manipulating the internal representation of the regexp. For
      example, when matching a (non-sticky, non-anchored) regular
      expression, the pattern must be wrapped in .* so that it can match
      anywhere in the input.
      
      In the interest of moving towards a cleaner division between irregexp
      and the outside world, it makes sense to move this code into
      RegExpCompiler.
      
      R=jgruber@chromium.org
      
      Bug: v8:10406
      Change-Id: I6da251c91c0016914a51480f80bb46c337fd0b23
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2140246Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#67262}
      58ac66b7
  28. 19 Mar, 2020 3 commits
    • Iain Ireland's avatar
      Reland "[regexp] Rewrite error handling" · 560f2d8b
      Iain Ireland authored
      This is a reland of e80ca24c
      
      Original change's description:
      > [regexp] Rewrite error handling
      >
      > This patch modifies irregexp's error handling. Instead of representing
      > errors as C strings, they are represented as an enumeration value
      > (RegExpError), and only converted to strings when throwing the error
      > object in regexp.cc. This makes it significantly easier to integrate
      > into SpiderMonkey. A few notes:
      >
      > 1. Depending on whether the stack overflows during parsing or
      >    analysis, the stack overflow message can vary ("Stack overflow" or
      >    "Maximum call stack size exceeded"). I kept that behaviour in this
      >    patch, under the assumption that stack overflow messages are
      >    (sadly) the sorts of things that real world code ends up depending
      >    on.
      >
      > 2. Depending on the point in code where the error was identified,
      >    invalid unicode escapes could be reported as "Invalid Unicode
      >    escape", "Invalid unicode escape", or "Invalid Unicode escape
      >    sequence". I fervently hope that nobody depends on the specific
      >    wording of a syntax error, so I standardized on the first one. (It
      >    was both the most common, and the most consistent with other
      >    "Invalid X escape" messages.)
      >
      > 3. In addition to changing the representation, this patch also adds an
      >    error_pos field to RegExpParser and RegExpCompileData, which stores
      >    the position at which an error occurred. This is used by
      >    SpiderMonkey to provide more helpful messages about where a syntax
      >    error occurred in large regular expressions.
      >
      > 4. This model is closer to V8's existing MessageTemplate
      >    infrastructure. I considered trying to integrate it more closely
      >    with MessageTemplate, but since one of our stated goals for this
      >    project was to make it easier to use irregexp outside of V8, I
      >    decided to hold off.
      >
      > R=jgruber@chromium.org
      >
      > Bug: v8:10303
      > Change-Id: I62605fd2def2fc539f38a7e0eefa04d36e14bbde
      > Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2091863
      > Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      > Reviewed-by: Jakob Gruber <jgruber@chromium.org>
      > Cr-Commit-Position: refs/heads/master@{#66784}
      
      R=jgruber@chromium.org
      
      Bug: v8:10303
      Change-Id: Iad1f11a0e0b9e525d7499aacb56c27eff9e7c7b5
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2109952Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#66798}
      560f2d8b
    • Leszek Swirski's avatar
      Revert "[regexp] Rewrite error handling" · 2193f691
      Leszek Swirski authored
      This reverts commit e80ca24c.
      
      Reason for revert: Causes failures in the fast/regex/non-pattern-characters.html Blink web test (https://ci.chromium.org/p/v8/builders/ci/V8%20Blink%20Linux/3679)
      
      Original change's description:
      > [regexp] Rewrite error handling
      > 
      > This patch modifies irregexp's error handling. Instead of representing
      > errors as C strings, they are represented as an enumeration value
      > (RegExpError), and only converted to strings when throwing the error
      > object in regexp.cc. This makes it significantly easier to integrate
      > into SpiderMonkey. A few notes:
      > 
      > 1. Depending on whether the stack overflows during parsing or
      >    analysis, the stack overflow message can vary ("Stack overflow" or
      >    "Maximum call stack size exceeded"). I kept that behaviour in this
      >    patch, under the assumption that stack overflow messages are
      >    (sadly) the sorts of things that real world code ends up depending
      >    on.
      > 
      > 2. Depending on the point in code where the error was identified,
      >    invalid unicode escapes could be reported as "Invalid Unicode
      >    escape", "Invalid unicode escape", or "Invalid Unicode escape
      >    sequence". I fervently hope that nobody depends on the specific
      >    wording of a syntax error, so I standardized on the first one. (It
      >    was both the most common, and the most consistent with other
      >    "Invalid X escape" messages.)
      > 
      > 3. In addition to changing the representation, this patch also adds an
      >    error_pos field to RegExpParser and RegExpCompileData, which stores
      >    the position at which an error occurred. This is used by
      >    SpiderMonkey to provide more helpful messages about where a syntax
      >    error occurred in large regular expressions.
      > 
      > 4. This model is closer to V8's existing MessageTemplate
      >    infrastructure. I considered trying to integrate it more closely
      >    with MessageTemplate, but since one of our stated goals for this
      >    project was to make it easier to use irregexp outside of V8, I
      >    decided to hold off.
      > 
      > R=​jgruber@chromium.org
      > 
      > Bug: v8:10303
      > Change-Id: I62605fd2def2fc539f38a7e0eefa04d36e14bbde
      > Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2091863
      > Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      > Reviewed-by: Jakob Gruber <jgruber@chromium.org>
      > Cr-Commit-Position: refs/heads/master@{#66784}
      
      TBR=jgruber@chromium.org,iireland@mozilla.com
      
      Change-Id: I9247635f3c5b17c943b9c4abaf82ebe7b2de165e
      No-Presubmit: true
      No-Tree-Checks: true
      No-Try: true
      Bug: v8:10303
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2108550Reviewed-by: 's avatarLeszek Swirski <leszeks@chromium.org>
      Commit-Queue: Leszek Swirski <leszeks@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#66786}
      2193f691
    • Iain Ireland's avatar
      [regexp] Rewrite error handling · e80ca24c
      Iain Ireland authored
      This patch modifies irregexp's error handling. Instead of representing
      errors as C strings, they are represented as an enumeration value
      (RegExpError), and only converted to strings when throwing the error
      object in regexp.cc. This makes it significantly easier to integrate
      into SpiderMonkey. A few notes:
      
      1. Depending on whether the stack overflows during parsing or
         analysis, the stack overflow message can vary ("Stack overflow" or
         "Maximum call stack size exceeded"). I kept that behaviour in this
         patch, under the assumption that stack overflow messages are
         (sadly) the sorts of things that real world code ends up depending
         on.
      
      2. Depending on the point in code where the error was identified,
         invalid unicode escapes could be reported as "Invalid Unicode
         escape", "Invalid unicode escape", or "Invalid Unicode escape
         sequence". I fervently hope that nobody depends on the specific
         wording of a syntax error, so I standardized on the first one. (It
         was both the most common, and the most consistent with other
         "Invalid X escape" messages.)
      
      3. In addition to changing the representation, this patch also adds an
         error_pos field to RegExpParser and RegExpCompileData, which stores
         the position at which an error occurred. This is used by
         SpiderMonkey to provide more helpful messages about where a syntax
         error occurred in large regular expressions.
      
      4. This model is closer to V8's existing MessageTemplate
         infrastructure. I considered trying to integrate it more closely
         with MessageTemplate, but since one of our stated goals for this
         project was to make it easier to use irregexp outside of V8, I
         decided to hold off.
      
      R=jgruber@chromium.org
      
      Bug: v8:10303
      Change-Id: I62605fd2def2fc539f38a7e0eefa04d36e14bbde
      Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2091863
      Commit-Queue: Jakob Gruber <jgruber@chromium.org>
      Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#66784}
      e80ca24c
  29. 17 Mar, 2020 1 commit
  30. 16 Mar, 2020 1 commit