• Justin Ridgewell's avatar
    Implement DFA Unicode Decoder · cedec225
    Justin Ridgewell authored
    This is a separation of the DFA Unicode Decoder from
    https://chromium-review.googlesource.com/c/v8/v8/+/789560
    
    I attempted to make the DFA's table a bit more explicit in this CL. Still, the
    linter prevents me from letting me present the array as a "table" in source
    code. For a better representation, please refer to
    https://docs.google.com/spreadsheets/d/1L9STtkmWs-A7HdK5ZmZ-wPZ_VBjQ3-Jj_xN9c6_hLKA
    
    - - - - -
    
    Now for a big copy-paste from 789560:
    
    Essentially, reworks a standard FSM (imagine an
    array of structs) and flattens it out into a single-dimension array.
    Using Table 3-7 of the Unicode 10.0.0 standard (page 126 of
    http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf), we can nicely
    map all bytes into one of 12 character classes:
    
    00. 0x00-0x7F
    01. 0x80-0x8F (split from general continuation because this range is not
        valid after a 0xF0 leading byte)
    02. 0x90-0x9F (split from general continuation because this range is not
        valid after a 0xE0 nor a 0xF4 leading byte)
    03. 0xA0-0xBF (the rest of the continuation range)
    04. 0xC0-0xC1, 0xF5-0xFF (the joined range of invalid bytes, notice this
        includes 255 which we use as a known bad byte during hex-to-int
            decoding)
    05. 0xC2-0xDF (leading bytes which require any continuation byte
        afterwards)
    06. 0xE0 (leading byte which requires a 0xA0-0xBF afterwards then any
        continuation byte after that)
    07. 0xE1-0xEC, 0xEE-0xEF (leading bytes which requires any continuation
        afterwards then any continuation byte after that)
    08. 0xED (leading byte which requires a 0x80-0x9F afterwards then any
        continuation byte after that)
    09. 0xF1-F3 (leading bytes which requires any continuation byte
        afterwards then any continuation byte then any continuation byte)
    10. 0xF0 (leading bytes which requires a 0x90-0xBF afterwards then any
        continuation byte then any continuation byte)
    11. 0xF4 (leading bytes which requires a 0x80-0x8F afterwards then any
        continuation byte then any continuation byte)
    
    Note that 0xF0 and 0xF1-0xF3 were swapped so that fewer bytes were
    needed to represent the transition state ("9, 10, 10, 10" vs.
    "10, 9, 9, 9").
    
    Using these 12 classes as "transitions", we can map from one state to
    the next. Each state is defined as some multiple of 12, so that we're
    always starting at the 0th column of each row of the FSM. From each
    state, we add the transition and get a index of the new row the FSM is
    entering.
    
    If at any point we encounter a bad byte, the state + bad-byte-transition
    is guaranteed to map us into the first row of the FSM (which contains no
    valid exiting transitions).
    
    The key differences from Björn's original (or his self-modified) DFA is
    the "bad" state is now mapped to 0 (or the first row of the FSM) instead
    of 12 (the second row). This saves ~50 bytes when gzipping, and also
    speeds up determining if a string is properly encoded (see his sample
    code at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance).
    
    Finally, I've replace his ternary check with an array access, to make
    the algorithm branchless. This places a requirement on the caller to 0
    out the code point between successful decodings, which it could always
    have done because it's already branching.
    
    R=marja@google.com
    
    Bug: 
    Change-Id: I574f208a84dc5d06caba17127b0d41f7ce1a3395
    Reviewed-on: https://chromium-review.googlesource.com/805357
    Commit-Queue: Justin Ridgewell <jridgewell@google.com>
    Reviewed-by: 's avatarMarja Hölttä <marja@chromium.org>
    Reviewed-by: 's avatarMathias Bynens <mathias@chromium.org>
    Cr-Commit-Position: refs/heads/master@{#50012}
    cedec225
unicode.cc 158 KB