1. 11 Dec, 2017 1 commit
    • Justin Ridgewell's avatar
      Implement DFA Unicode Decoder · cedec225
      Justin Ridgewell authored
      This is a separation of the DFA Unicode Decoder from
      https://chromium-review.googlesource.com/c/v8/v8/+/789560
      
      I attempted to make the DFA's table a bit more explicit in this CL. Still, the
      linter prevents me from letting me present the array as a "table" in source
      code. For a better representation, please refer to
      https://docs.google.com/spreadsheets/d/1L9STtkmWs-A7HdK5ZmZ-wPZ_VBjQ3-Jj_xN9c6_hLKA
      
      - - - - -
      
      Now for a big copy-paste from 789560:
      
      Essentially, reworks a standard FSM (imagine an
      array of structs) and flattens it out into a single-dimension array.
      Using Table 3-7 of the Unicode 10.0.0 standard (page 126 of
      http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf), we can nicely
      map all bytes into one of 12 character classes:
      
      00. 0x00-0x7F
      01. 0x80-0x8F (split from general continuation because this range is not
          valid after a 0xF0 leading byte)
      02. 0x90-0x9F (split from general continuation because this range is not
          valid after a 0xE0 nor a 0xF4 leading byte)
      03. 0xA0-0xBF (the rest of the continuation range)
      04. 0xC0-0xC1, 0xF5-0xFF (the joined range of invalid bytes, notice this
          includes 255 which we use as a known bad byte during hex-to-int
              decoding)
      05. 0xC2-0xDF (leading bytes which require any continuation byte
          afterwards)
      06. 0xE0 (leading byte which requires a 0xA0-0xBF afterwards then any
          continuation byte after that)
      07. 0xE1-0xEC, 0xEE-0xEF (leading bytes which requires any continuation
          afterwards then any continuation byte after that)
      08. 0xED (leading byte which requires a 0x80-0x9F afterwards then any
          continuation byte after that)
      09. 0xF1-F3 (leading bytes which requires any continuation byte
          afterwards then any continuation byte then any continuation byte)
      10. 0xF0 (leading bytes which requires a 0x90-0xBF afterwards then any
          continuation byte then any continuation byte)
      11. 0xF4 (leading bytes which requires a 0x80-0x8F afterwards then any
          continuation byte then any continuation byte)
      
      Note that 0xF0 and 0xF1-0xF3 were swapped so that fewer bytes were
      needed to represent the transition state ("9, 10, 10, 10" vs.
      "10, 9, 9, 9").
      
      Using these 12 classes as "transitions", we can map from one state to
      the next. Each state is defined as some multiple of 12, so that we're
      always starting at the 0th column of each row of the FSM. From each
      state, we add the transition and get a index of the new row the FSM is
      entering.
      
      If at any point we encounter a bad byte, the state + bad-byte-transition
      is guaranteed to map us into the first row of the FSM (which contains no
      valid exiting transitions).
      
      The key differences from Björn's original (or his self-modified) DFA is
      the "bad" state is now mapped to 0 (or the first row of the FSM) instead
      of 12 (the second row). This saves ~50 bytes when gzipping, and also
      speeds up determining if a string is properly encoded (see his sample
      code at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance).
      
      Finally, I've replace his ternary check with an array access, to make
      the algorithm branchless. This places a requirement on the caller to 0
      out the code point between successful decodings, which it could always
      have done because it's already branching.
      
      R=marja@google.com
      
      Bug: 
      Change-Id: I574f208a84dc5d06caba17127b0d41f7ce1a3395
      Reviewed-on: https://chromium-review.googlesource.com/805357
      Commit-Queue: Justin Ridgewell <jridgewell@google.com>
      Reviewed-by: 's avatarMarja Hölttä <marja@chromium.org>
      Reviewed-by: 's avatarMathias Bynens <mathias@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#50012}
      cedec225
  2. 02 Oct, 2017 1 commit
    • Mathias Bynens's avatar
      [parser] Add use counter for U+2028 & U+2029 · d3c98121
      Mathias Bynens authored
      The context is the following proposal to make JSON a subset of
      JavaScript: https://github.com/tc39/proposal-json-superset
      
      There’s interest in performing a side investigation to answer the
      question of what would happen if we stopped treating U+2028 and U+2029
      as `LineTerminator`s *entirely*. (Note that this is separate from the
      proposal, which just changes how these characters are handled in
      ECMAScript strings.) This is technically a breaking change, and IMHO it
      would be wonderful if we could get away with it, but no one really has
      any data on whether or not we could. Adding this use counter lets us get
      that data.
      
      BUG=v8:6827
      
      Cq-Include-Trybots: master.tryserver.chromium.linux:linux_chromium_rel_ng
      Change-Id: Ia22e8db1634df4d3f965bec8e1cfa11cc7b5e9aa
      Reviewed-on: https://chromium-review.googlesource.com/693155
      Commit-Queue: Mathias Bynens <mathias@chromium.org>
      Reviewed-by: 's avatarMarja Hölttä <marja@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#48260}
      d3c98121
  3. 29 Sep, 2017 1 commit
  4. 05 Sep, 2017 1 commit
  5. 29 Jun, 2017 1 commit
  6. 14 Jun, 2017 1 commit
    • jshin's avatar
      Use ICU for ID_START, ID_CONTINUE and WhiteSpace check · 4aeb94a4
      jshin authored
      Use ICU to check ID_Start, ID_Continue and WhiteSpace even for BMP
      when V8_INTL_SUPPORT is on (which is default).
      
      Change LineTerminator::Is() to check 4 code points from
      ES#sec-line-terminators instead of using tables and Lookup function.
      
      Remove Lowercase::Is(). It's not used anywhere.
      
      Update webkit/{ToNumber,parseFloat}.js to have the correct expectation
      for U+180E and the corresponding expected files. This is a follow-up to
      an earlier change ( https://codereview.chromium.org/2720953003 ).
      
      CQ_INCLUDE_TRYBOTS=master.tryserver.v8:v8_win_dbg,v8_mac_dbg;master.tryserver.chromium.android:android_arm64_dbg_recipe
      CQ_INCLUDE_TRYBOTS=master.tryserver.v8:v8_linux_noi18n_rel_ng
      
      BUG=v8:5370,v8:5155
      TEST=unittests --gtest_filter=CharP*
      TEST=webkit: ToNumber, parseFloat
      TEST=test262: built-ins/Number/S9.3*, built-ins/parse{Int,Float}/S15*
      TEST=test262: language/white-space/mong*
      TEST=test262: built-ins/String/prototype/trim/u180e
      TEST=mjsunit: whitespaces
      
      Review-Url: https://codereview.chromium.org/2331303002
      Cr-Commit-Position: refs/heads/master@{#45957}
      4aeb94a4
  7. 23 May, 2017 1 commit
    • Andreas Haas's avatar
      [wasm] Also kBadChar is a valid utf8 character · 8e0daf78
      Andreas Haas authored
      The validation of utf8 strings in WebAssembly modules used the character
      kBadChar = 0xFFFD to indicate a validation error. However, this
      character can appear in a valid utf8 string. This CL fixes this problem
      by duplicating some of the code in {Utf8::CalculateValue} and inlining
      it directly into Utf8::Validate. Note that Utf8::Validate is used only
      for WebAssembly.
      
      Tests for this change are in the WebAssembly spec tests, which I will
      update in a separate CL.
      
      R=vogelheim@chromium.org
      
      Change-Id: I8697b9299f3e98a8eafdf193bff8bdff90efd7dc
      Reviewed-on: https://chromium-review.googlesource.com/509534Reviewed-by: 's avatarDaniel Vogelheim <vogelheim@chromium.org>
      Commit-Queue: Andreas Haas <ahaas@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#45476}
      8e0daf78
  8. 17 Oct, 2016 1 commit
  9. 05 Oct, 2016 1 commit
  10. 16 Sep, 2016 1 commit
    • vogelheim's avatar
      Rework scanner-character-streams. · 642d6d31
      vogelheim authored
      - Smaller, more consistent streams API (Advance, Back, pos, Seek)
      - Remove implementations from the header, in favor of creation functions.
      
      Observe:
      - Performance:
        - All Utf16CharacterStream methods have an inlinable V8_LIKELY w/ a
          body of only a few instructions. I expect most calls to end up there.
        - There used to be performance problems w/ bookmarking, particularly
          with copying too much data on SetBookmark w/ UTF-8 streaming streams.
          All those copies are gone.
        - The old streaming streams implementation used to copy data even for
          2-byte input. It no longer does.
        - The only remaining 'slow' method is the Seek(.) slow case for utf-8
          streaming streams. I don't expect this to be called a lot; and even if,
          I expect it to be offset by the gains in the (vastly more frequent)
          calls to the other methods or the 'fast path'.
        - If it still bothers us, there are several ways to speed it up.
      - API & code cleanliness:
        - I want to remove the 'old' API in a follow-up CL, which should mostly
          delete code, or replace it 1:1.
        - In a 2nd follow-up I want to delete much of the UTF-8 handling in Blink
          for streaming streams.
        - The "bookmark" is now always implemented (and mostly very fast), so we
          should be able to use it for more things.
      - Testing & correctness:
        - The unit tests now cover all stream implementations,
          and are pretty good and triggering all the edge cases.
        - Vastly more DCHECKs of the invariants.
      
      BUG=v8:4947
      
      Review-Url: https://codereview.chromium.org/2314663002
      Cr-Commit-Position: refs/heads/master@{#39464}
      642d6d31
  11. 12 May, 2016 1 commit
    • clemensh's avatar
      [wasm] Add UTF-8 validation · f0523e30
      clemensh authored
      Names passed for imports and exports are checked during decoding,
      leading to errors if they are no valid UTF-8. Function names are not
      checked during decode, but rather lead to undefined being returned at
      runtime if they are not UTF-8.
      
      We need to do these checks on the Wasm side, since the factory
      methods assume to get valid UTF-8 strings.
      
      R=titzer@chromium.org, yangguo@chromium.org
      
      Review-Url: https://codereview.chromium.org/1967023004
      Cr-Commit-Position: refs/heads/master@{#36208}
      f0523e30
  12. 02 Feb, 2016 2 commits
  13. 05 Feb, 2015 1 commit
  14. 05 Nov, 2014 1 commit
  15. 08 Oct, 2014 1 commit
  16. 03 Jun, 2014 1 commit
  17. 29 Apr, 2014 1 commit
  18. 10 Feb, 2014 1 commit
  19. 20 Jan, 2014 1 commit
  20. 07 Nov, 2013 1 commit
  21. 04 Oct, 2013 1 commit
  22. 13 Mar, 2013 2 commits
  23. 21 Jan, 2013 1 commit
  24. 16 Jan, 2013 1 commit
  25. 09 Jan, 2013 1 commit
  26. 03 Jan, 2013 1 commit
  27. 20 Dec, 2012 2 commits
  28. 19 Dec, 2012 1 commit
  29. 28 Aug, 2012 1 commit
  30. 12 Mar, 2012 2 commits
  31. 29 Nov, 2011 1 commit
  32. 18 Mar, 2011 3 commits
  33. 14 Sep, 2010 1 commit
  34. 30 Jul, 2010 1 commit