1. 06 Feb, 2019 1 commit
  2. 05 Feb, 2019 1 commit
  3. 04 Feb, 2019 1 commit
  4. 11 Dec, 2017 1 commit
    • Justin Ridgewell's avatar
      Implement DFA Unicode Decoder · cedec225
      Justin Ridgewell authored
      This is a separation of the DFA Unicode Decoder from
      https://chromium-review.googlesource.com/c/v8/v8/+/789560
      
      I attempted to make the DFA's table a bit more explicit in this CL. Still, the
      linter prevents me from letting me present the array as a "table" in source
      code. For a better representation, please refer to
      https://docs.google.com/spreadsheets/d/1L9STtkmWs-A7HdK5ZmZ-wPZ_VBjQ3-Jj_xN9c6_hLKA
      
      - - - - -
      
      Now for a big copy-paste from 789560:
      
      Essentially, reworks a standard FSM (imagine an
      array of structs) and flattens it out into a single-dimension array.
      Using Table 3-7 of the Unicode 10.0.0 standard (page 126 of
      http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf), we can nicely
      map all bytes into one of 12 character classes:
      
      00. 0x00-0x7F
      01. 0x80-0x8F (split from general continuation because this range is not
          valid after a 0xF0 leading byte)
      02. 0x90-0x9F (split from general continuation because this range is not
          valid after a 0xE0 nor a 0xF4 leading byte)
      03. 0xA0-0xBF (the rest of the continuation range)
      04. 0xC0-0xC1, 0xF5-0xFF (the joined range of invalid bytes, notice this
          includes 255 which we use as a known bad byte during hex-to-int
              decoding)
      05. 0xC2-0xDF (leading bytes which require any continuation byte
          afterwards)
      06. 0xE0 (leading byte which requires a 0xA0-0xBF afterwards then any
          continuation byte after that)
      07. 0xE1-0xEC, 0xEE-0xEF (leading bytes which requires any continuation
          afterwards then any continuation byte after that)
      08. 0xED (leading byte which requires a 0x80-0x9F afterwards then any
          continuation byte after that)
      09. 0xF1-F3 (leading bytes which requires any continuation byte
          afterwards then any continuation byte then any continuation byte)
      10. 0xF0 (leading bytes which requires a 0x90-0xBF afterwards then any
          continuation byte then any continuation byte)
      11. 0xF4 (leading bytes which requires a 0x80-0x8F afterwards then any
          continuation byte then any continuation byte)
      
      Note that 0xF0 and 0xF1-0xF3 were swapped so that fewer bytes were
      needed to represent the transition state ("9, 10, 10, 10" vs.
      "10, 9, 9, 9").
      
      Using these 12 classes as "transitions", we can map from one state to
      the next. Each state is defined as some multiple of 12, so that we're
      always starting at the 0th column of each row of the FSM. From each
      state, we add the transition and get a index of the new row the FSM is
      entering.
      
      If at any point we encounter a bad byte, the state + bad-byte-transition
      is guaranteed to map us into the first row of the FSM (which contains no
      valid exiting transitions).
      
      The key differences from Björn's original (or his self-modified) DFA is
      the "bad" state is now mapped to 0 (or the first row of the FSM) instead
      of 12 (the second row). This saves ~50 bytes when gzipping, and also
      speeds up determining if a string is properly encoded (see his sample
      code at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance).
      
      Finally, I've replace his ternary check with an array access, to make
      the algorithm branchless. This places a requirement on the caller to 0
      out the code point between successful decodings, which it could always
      have done because it's already branching.
      
      R=marja@google.com
      
      Bug: 
      Change-Id: I574f208a84dc5d06caba17127b0d41f7ce1a3395
      Reviewed-on: https://chromium-review.googlesource.com/805357
      Commit-Queue: Justin Ridgewell <jridgewell@google.com>
      Reviewed-by: 's avatarMarja Hölttä <marja@chromium.org>
      Reviewed-by: 's avatarMathias Bynens <mathias@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#50012}
      cedec225
  5. 02 Dec, 2017 1 commit
    • Mathias Bynens's avatar
      Normalize casing of hexadecimal digits · 822be9b2
      Mathias Bynens authored
      This patch normalizes the casing of hexadecimal digits in escape
      sequences of the form `\xNN` and integer literals of the form
      `0xNNNN`.
      
      Previously, the V8 code base used an inconsistent mixture of uppercase
      and lowercase.
      
      Google’s C++ style guide uses uppercase in its examples:
      https://google.github.io/styleguide/cppguide.html#Non-ASCII_Characters
      
      Moreover, uppercase letters more clearly stand out from the lowercase
      `x` (or `u`) characters at the start, as well as lowercase letters
      elsewhere in strings.
      
      BUG=v8:7109
      TBR=marja@chromium.org,titzer@chromium.org,mtrofin@chromium.org,mstarzinger@chromium.org,rossberg@chromium.org,yangguo@chromium.org,mlippautz@chromium.org
      NOPRESUBMIT=true
      
      Cq-Include-Trybots: master.tryserver.blink:linux_trusty_blink_rel;master.tryserver.chromium.linux:linux_chromium_rel_ng
      Change-Id: I790e21c25d96ad5d95c8229724eb45d2aa9e22d6
      Reviewed-on: https://chromium-review.googlesource.com/804294
      Commit-Queue: Mathias Bynens <mathias@chromium.org>
      Reviewed-by: 's avatarJakob Kummerow <jkummerow@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#49810}
      822be9b2
  6. 01 Dec, 2017 1 commit
  7. 29 Sep, 2017 1 commit
  8. 26 Sep, 2017 1 commit
  9. 21 Sep, 2017 1 commit
    • Marja Hölttä's avatar
      [unicode] Return (the correct) errors for overlong / surrogate sequences. · 6389b7e6
      Marja Hölttä authored
      This fix is two-fold:
      
      1) Incremental UTF-8 decoding: Unify incorrect UTF-8 handling between V8 and
      Blink.
      
      Incremental UTF-8 decoding used to allow some overlong sequences / invalid code
      points which Blink treated as errors. This caused the decoder and the Blink
      UTF-8 decoder to produce a different number of bytes, resulting in random
      failures when scripts were streamed (especially, this was detected by the
      skipping inner functions feature which adds CHECKs against expected function
      positions).
      
      2) Non-incremental UTF-8 decoding: return the correct amount of invalid characters.
      
      According to the encoding spec ( https://encoding.spec.whatwg.org/#utf-8-decoder
      ), the first byte of an overlong sequence / invalid code point generates an
      invalid character, and the rest of the bytes are not processed (i.e., pushed
      back to the byte stream). When they're handled, they will look like lonely
      continuation bytes, and will generate an invalid character each.
      
      As a result, an overlong 4-byte sequence should generate 4 invalid characters
      (not 1).
      
      This is a potentially breaking change, since the (non-incremental) UTF-8
      decoding is exposed via the API (String::NewFromUtf8). The behavioral difference
      happens when the client is passing in invalid UTF-8 (containing overlong /
      surrogate sequences).
      
      However, afaict, this doesn't change the semantics of any JavaScript program:
      according to the ECMAScript spec, the program is a sequence of Unicode code
      points, and there's no way to invoke the UTF-8 decoding functionalities from
      inside JavaScript. Though, this changes the behavior of d8 when decoding source
      files which are invalid UTF-8.
      
      This doesn't change anything related to URI decoding (it already throws
      exceptions for overlong sequences / invalid code points).
      
      BUG: chromium:765608, chromium:758236, v8:5516
      Bug: 
      Change-Id: Ib029f6a8e87186794b092e4e8af32d01cee3ada0
      Reviewed-on: https://chromium-review.googlesource.com/671020
      Commit-Queue: Marja Hölttä <marja@chromium.org>
      Reviewed-by: 's avatarFranziska Hinkelmann <franzih@chromium.org>
      Reviewed-by: 's avatarCamillo Bruni <cbruni@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#48105}
      6389b7e6
  10. 05 Sep, 2017 1 commit
  11. 29 Jun, 2017 1 commit
  12. 14 Jun, 2017 1 commit
    • jshin's avatar
      Use ICU for ID_START, ID_CONTINUE and WhiteSpace check · 4aeb94a4
      jshin authored
      Use ICU to check ID_Start, ID_Continue and WhiteSpace even for BMP
      when V8_INTL_SUPPORT is on (which is default).
      
      Change LineTerminator::Is() to check 4 code points from
      ES#sec-line-terminators instead of using tables and Lookup function.
      
      Remove Lowercase::Is(). It's not used anywhere.
      
      Update webkit/{ToNumber,parseFloat}.js to have the correct expectation
      for U+180E and the corresponding expected files. This is a follow-up to
      an earlier change ( https://codereview.chromium.org/2720953003 ).
      
      CQ_INCLUDE_TRYBOTS=master.tryserver.v8:v8_win_dbg,v8_mac_dbg;master.tryserver.chromium.android:android_arm64_dbg_recipe
      CQ_INCLUDE_TRYBOTS=master.tryserver.v8:v8_linux_noi18n_rel_ng
      
      BUG=v8:5370,v8:5155
      TEST=unittests --gtest_filter=CharP*
      TEST=webkit: ToNumber, parseFloat
      TEST=test262: built-ins/Number/S9.3*, built-ins/parse{Int,Float}/S15*
      TEST=test262: language/white-space/mong*
      TEST=test262: built-ins/String/prototype/trim/u180e
      TEST=mjsunit: whitespaces
      
      Review-Url: https://codereview.chromium.org/2331303002
      Cr-Commit-Position: refs/heads/master@{#45957}
      4aeb94a4
  13. 23 May, 2017 1 commit
    • Andreas Haas's avatar
      [wasm] Also kBadChar is a valid utf8 character · 8e0daf78
      Andreas Haas authored
      The validation of utf8 strings in WebAssembly modules used the character
      kBadChar = 0xFFFD to indicate a validation error. However, this
      character can appear in a valid utf8 string. This CL fixes this problem
      by duplicating some of the code in {Utf8::CalculateValue} and inlining
      it directly into Utf8::Validate. Note that Utf8::Validate is used only
      for WebAssembly.
      
      Tests for this change are in the WebAssembly spec tests, which I will
      update in a separate CL.
      
      R=vogelheim@chromium.org
      
      Change-Id: I8697b9299f3e98a8eafdf193bff8bdff90efd7dc
      Reviewed-on: https://chromium-review.googlesource.com/509534Reviewed-by: 's avatarDaniel Vogelheim <vogelheim@chromium.org>
      Commit-Queue: Andreas Haas <ahaas@chromium.org>
      Cr-Commit-Position: refs/heads/master@{#45476}
      8e0daf78
  14. 22 May, 2017 1 commit
  15. 28 Feb, 2017 1 commit
    • yangguo's avatar
      [unibrow] remove mongolian vowel separator as white space. · a5dfa062
      yangguo authored
      Unibrow is currently at Unicode version 7.0.0, which does not
      include mongolian vowel separator (\u180E) as white space. In
      order to appease test262 at the time however we kept it as a
      whitespace.
      
      Test262 has since then been updated. And while this is not an
      update of unibrow, we are removing \u180E as white space here.
      
      R=jshin@chromium.org, littledan@chromium.org
      BUG=v8:5155
      
      Review-Url: https://codereview.chromium.org/2720953003
      Cr-Commit-Position: refs/heads/master@{#43485}
      a5dfa062
  16. 22 Nov, 2016 1 commit
  17. 16 Nov, 2016 1 commit
    • vogelheim's avatar
      Return kBadChar for longest subpart of incomplete utf-8 character. · fd40ebb1
      vogelheim authored
      This brings the two utf-8 decoders (bulk + incremental) in line.
      Technically, either behaviour was correct, since the utf-8 spec
      demands incomplete utf-8 be handled, but does not specify how.
      Unicode recommends that "the maximal subpart at that offset
      should be replaced by a single U+FFFD," and with this change we
      consistently do that. More details + spec references in the bug.
      
      BUG=chromium:662822
      
      Review-Url: https://codereview.chromium.org/2493143003
      Cr-Commit-Position: refs/heads/master@{#41025}
      fd40ebb1
  18. 11 Nov, 2016 1 commit
  19. 05 Oct, 2016 1 commit
  20. 16 Sep, 2016 1 commit
    • vogelheim's avatar
      Rework scanner-character-streams. · 642d6d31
      vogelheim authored
      - Smaller, more consistent streams API (Advance, Back, pos, Seek)
      - Remove implementations from the header, in favor of creation functions.
      
      Observe:
      - Performance:
        - All Utf16CharacterStream methods have an inlinable V8_LIKELY w/ a
          body of only a few instructions. I expect most calls to end up there.
        - There used to be performance problems w/ bookmarking, particularly
          with copying too much data on SetBookmark w/ UTF-8 streaming streams.
          All those copies are gone.
        - The old streaming streams implementation used to copy data even for
          2-byte input. It no longer does.
        - The only remaining 'slow' method is the Seek(.) slow case for utf-8
          streaming streams. I don't expect this to be called a lot; and even if,
          I expect it to be offset by the gains in the (vastly more frequent)
          calls to the other methods or the 'fast path'.
        - If it still bothers us, there are several ways to speed it up.
      - API & code cleanliness:
        - I want to remove the 'old' API in a follow-up CL, which should mostly
          delete code, or replace it 1:1.
        - In a 2nd follow-up I want to delete much of the UTF-8 handling in Blink
          for streaming streams.
        - The "bookmark" is now always implemented (and mostly very fast), so we
          should be able to use it for more things.
      - Testing & correctness:
        - The unit tests now cover all stream implementations,
          and are pretty good and triggering all the edge cases.
        - Vastly more DCHECKs of the invariants.
      
      BUG=v8:4947
      
      Review-Url: https://codereview.chromium.org/2314663002
      Cr-Commit-Position: refs/heads/master@{#39464}
      642d6d31
  21. 12 May, 2016 1 commit
    • clemensh's avatar
      [wasm] Add UTF-8 validation · f0523e30
      clemensh authored
      Names passed for imports and exports are checked during decoding,
      leading to errors if they are no valid UTF-8. Function names are not
      checked during decode, but rather lead to undefined being returned at
      runtime if they are not UTF-8.
      
      We need to do these checks on the Wasm side, since the factory
      methods assume to get valid UTF-8 strings.
      
      R=titzer@chromium.org, yangguo@chromium.org
      
      Review-Url: https://codereview.chromium.org/1967023004
      Cr-Commit-Position: refs/heads/master@{#36208}
      f0523e30
  22. 03 Sep, 2015 1 commit
  23. 22 May, 2015 1 commit
  24. 05 Feb, 2015 1 commit
  25. 08 Oct, 2014 1 commit
  26. 04 Aug, 2014 1 commit
  27. 20 Jun, 2014 1 commit
  28. 03 Jun, 2014 1 commit
  29. 29 Apr, 2014 1 commit
  30. 10 Feb, 2014 1 commit
  31. 05 Jul, 2013 1 commit
  32. 03 Jan, 2013 1 commit
  33. 20 Dec, 2012 1 commit
  34. 12 Mar, 2012 1 commit
  35. 06 Mar, 2012 1 commit
  36. 16 Jan, 2012 1 commit
  37. 18 Mar, 2011 3 commits
  38. 03 Jan, 2011 1 commit