Commit cedec225 authored by Justin Ridgewell's avatar Justin Ridgewell Committed by Commit Bot

Implement DFA Unicode Decoder

This is a separation of the DFA Unicode Decoder from
https://chromium-review.googlesource.com/c/v8/v8/+/789560

I attempted to make the DFA's table a bit more explicit in this CL. Still, the
linter prevents me from letting me present the array as a "table" in source
code. For a better representation, please refer to
https://docs.google.com/spreadsheets/d/1L9STtkmWs-A7HdK5ZmZ-wPZ_VBjQ3-Jj_xN9c6_hLKA

- - - - -

Now for a big copy-paste from 789560:

Essentially, reworks a standard FSM (imagine an
array of structs) and flattens it out into a single-dimension array.
Using Table 3-7 of the Unicode 10.0.0 standard (page 126 of
http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf), we can nicely
map all bytes into one of 12 character classes:

00. 0x00-0x7F
01. 0x80-0x8F (split from general continuation because this range is not
    valid after a 0xF0 leading byte)
02. 0x90-0x9F (split from general continuation because this range is not
    valid after a 0xE0 nor a 0xF4 leading byte)
03. 0xA0-0xBF (the rest of the continuation range)
04. 0xC0-0xC1, 0xF5-0xFF (the joined range of invalid bytes, notice this
    includes 255 which we use as a known bad byte during hex-to-int
        decoding)
05. 0xC2-0xDF (leading bytes which require any continuation byte
    afterwards)
06. 0xE0 (leading byte which requires a 0xA0-0xBF afterwards then any
    continuation byte after that)
07. 0xE1-0xEC, 0xEE-0xEF (leading bytes which requires any continuation
    afterwards then any continuation byte after that)
08. 0xED (leading byte which requires a 0x80-0x9F afterwards then any
    continuation byte after that)
09. 0xF1-F3 (leading bytes which requires any continuation byte
    afterwards then any continuation byte then any continuation byte)
10. 0xF0 (leading bytes which requires a 0x90-0xBF afterwards then any
    continuation byte then any continuation byte)
11. 0xF4 (leading bytes which requires a 0x80-0x8F afterwards then any
    continuation byte then any continuation byte)

Note that 0xF0 and 0xF1-0xF3 were swapped so that fewer bytes were
needed to represent the transition state ("9, 10, 10, 10" vs.
"10, 9, 9, 9").

Using these 12 classes as "transitions", we can map from one state to
the next. Each state is defined as some multiple of 12, so that we're
always starting at the 0th column of each row of the FSM. From each
state, we add the transition and get a index of the new row the FSM is
entering.

If at any point we encounter a bad byte, the state + bad-byte-transition
is guaranteed to map us into the first row of the FSM (which contains no
valid exiting transitions).

The key differences from Björn's original (or his self-modified) DFA is
the "bad" state is now mapped to 0 (or the first row of the FSM) instead
of 12 (the second row). This saves ~50 bytes when gzipping, and also
speeds up determining if a string is properly encoded (see his sample
code at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance).

Finally, I've replace his ternary check with an array access, to make
the algorithm branchless. This places a requirement on the caller to 0
out the code point between successful decodings, which it could always
have done because it's already branching.

R=marja@google.com

Bug: 
Change-Id: I574f208a84dc5d06caba17127b0d41f7ce1a3395
Reviewed-on: https://chromium-review.googlesource.com/805357
Commit-Queue: Justin Ridgewell <jridgewell@google.com>
Reviewed-by: 's avatarMarja Hölttä <marja@chromium.org>
Reviewed-by: 's avatarMathias Bynens <mathias@chromium.org>
Cr-Commit-Position: refs/heads/master@{#50012}
parent a91d043d
......@@ -2052,6 +2052,7 @@ v8_source_set("v8_base") {
"src/string-stream.h",
"src/strtod.cc",
"src/strtod.h",
"src/third_party/utf8-decoder/utf8-decoder.h",
"src/tracing/trace-event.cc",
"src/tracing/trace-event.h",
"src/tracing/traced-value.cc",
......
......@@ -203,7 +203,7 @@ class Utf8ExternalStreamingStream : public BufferedUtf16CharacterStream {
Utf8ExternalStreamingStream(
ScriptCompiler::ExternalSourceStream* source_stream,
RuntimeCallStats* stats)
: current_({0, {0, 0, unibrow::Utf8::Utf8IncrementalBuffer(0)}}),
: current_({0, {0, 0, 0, unibrow::Utf8::State::kAccept}}),
source_stream_(source_stream),
stats_(stats) {}
~Utf8ExternalStreamingStream() override {
......@@ -223,7 +223,8 @@ class Utf8ExternalStreamingStream : public BufferedUtf16CharacterStream {
struct StreamPosition {
size_t bytes;
size_t chars;
unibrow::Utf8::Utf8IncrementalBuffer incomplete_char;
uint32_t incomplete_char;
unibrow::Utf8::State state;
};
// Position contains a StreamPosition and the index of the chunk the position
......@@ -268,25 +269,25 @@ bool Utf8ExternalStreamingStream::SkipToPosition(size_t position) {
const Chunk& chunk = chunks_[current_.chunk_no];
DCHECK(current_.pos.bytes >= chunk.start.bytes);
unibrow::Utf8::Utf8IncrementalBuffer incomplete_char =
chunk.start.incomplete_char;
unibrow::Utf8::State state = chunk.start.state;
uint32_t incomplete_char = chunk.start.incomplete_char;
size_t it = current_.pos.bytes - chunk.start.bytes;
size_t chars = chunk.start.chars;
while (it < chunk.length && chars < position) {
unibrow::uchar t =
unibrow::Utf8::ValueOfIncremental(chunk.data[it], &incomplete_char);
unibrow::uchar t = unibrow::Utf8::ValueOfIncremental(
chunk.data[it], &it, &state, &incomplete_char);
if (t == kUtf8Bom && current_.pos.chars == 0) {
// BOM detected at beginning of the stream. Don't copy it.
} else if (t != unibrow::Utf8::kIncomplete) {
chars++;
if (t > unibrow::Utf16::kMaxNonSurrogateCharCode) chars++;
}
it++;
}
current_.pos.bytes += it;
current_.pos.chars = chars;
current_.pos.incomplete_char = incomplete_char;
current_.pos.state = state;
current_.chunk_no += (it == chunk.length);
return current_.pos.chars == position;
......@@ -304,31 +305,33 @@ void Utf8ExternalStreamingStream::FillBufferFromCurrentChunk() {
uint16_t* cursor = buffer_ + (buffer_end_ - buffer_start_);
DCHECK_EQ(cursor, buffer_end_);
unibrow::Utf8::State state = current_.pos.state;
uint32_t incomplete_char = current_.pos.incomplete_char;
// If the current chunk is the last (empty) chunk we'll have to process
// any left-over, partial characters.
if (chunk.length == 0) {
unibrow::uchar t =
unibrow::Utf8::ValueOfIncrementalFinish(&current_.pos.incomplete_char);
unibrow::uchar t = unibrow::Utf8::ValueOfIncrementalFinish(&state);
if (t != unibrow::Utf8::kBufferEmpty) {
DCHECK_LT(t, unibrow::Utf16::kMaxNonSurrogateCharCode);
DCHECK_EQ(t, unibrow::Utf8::kBadChar);
*cursor = static_cast<uc16>(t);
buffer_end_++;
current_.pos.chars++;
current_.pos.incomplete_char = 0;
current_.pos.state = state;
}
return;
}
unibrow::Utf8::Utf8IncrementalBuffer incomplete_char =
current_.pos.incomplete_char;
size_t it;
for (it = current_.pos.bytes - chunk.start.bytes;
it < chunk.length && cursor + 1 < buffer_start_ + kBufferSize; it++) {
unibrow::uchar t =
unibrow::Utf8::ValueOfIncremental(chunk.data[it], &incomplete_char);
if (t == unibrow::Utf8::kIncomplete) continue;
size_t it = current_.pos.bytes - chunk.start.bytes;
while (it < chunk.length && cursor + 1 < buffer_start_ + kBufferSize) {
unibrow::uchar t = unibrow::Utf8::ValueOfIncremental(
chunk.data[it], &it, &state, &incomplete_char);
if (V8_LIKELY(t < kUtf8Bom)) {
*(cursor++) = static_cast<uc16>(t); // The by most frequent case.
} else if (t == kUtf8Bom && current_.pos.bytes + it == 2) {
} else if (t == unibrow::Utf8::kIncomplete) {
continue;
} else if (t == kUtf8Bom && current_.pos.bytes + it == 3) {
// BOM detected at beginning of the stream. Don't copy it.
} else if (t <= unibrow::Utf16::kMaxNonSurrogateCharCode) {
*(cursor++) = static_cast<uc16>(t);
......@@ -341,6 +344,7 @@ void Utf8ExternalStreamingStream::FillBufferFromCurrentChunk() {
current_.pos.bytes = chunk.start.bytes + it;
current_.pos.chars += (cursor - buffer_end_);
current_.pos.incomplete_char = incomplete_char;
current_.pos.state = state;
current_.chunk_no += (it == chunk.length);
buffer_end_ = cursor;
......@@ -396,16 +400,15 @@ void Utf8ExternalStreamingStream::SearchPosition(size_t position) {
// checking whether the # bytes in a chunk are equal to the # chars, and if
// so avoid the expensive SkipToPosition.)
bool ascii_only_chunk =
chunks_[chunk_no].start.incomplete_char ==
unibrow::Utf8::Utf8IncrementalBuffer(0) &&
chunks_[chunk_no].start.incomplete_char == 0 &&
(chunks_[chunk_no + 1].start.bytes - chunks_[chunk_no].start.bytes) ==
(chunks_[chunk_no + 1].start.chars - chunks_[chunk_no].start.chars);
if (ascii_only_chunk) {
size_t skip = position - chunks_[chunk_no].start.chars;
current_ = {chunk_no,
{chunks_[chunk_no].start.bytes + skip,
chunks_[chunk_no].start.chars + skip,
unibrow::Utf8::Utf8IncrementalBuffer(0)}};
chunks_[chunk_no].start.chars + skip, 0,
unibrow::Utf8::State::kAccept}};
} else {
current_ = {chunk_no, chunks_[chunk_no].start};
SkipToPosition(position);
......
Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Name: DFA UTF-8 Decoder
Short Name: utf8-decoder
URL: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
Version: 0
License: MIT
License File: NOT_SHIPPED
Security Critical: no
Description:
Decodes UTF-8 bytes using a fast and simple definite finite automata.
Local modifications:
- Rejection state has been mapped to row 0 (instead of row 1) of the DFA,
saving some 50 bytes and making the table easier to reason about.
- The transitions have been remapped to represent both a state transition and a
bit mask for the incoming byte.
- The caller must now zero out the code point buffer after successful or
unsuccessful state transitions.
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
// The remapped transition table is justified at
// https://docs.google.com/spreadsheets/d/1AZcQwuEL93HmNCljJWUwFMGqf7JAQ0puawZaUgP0E14
#include <stdint.h>
#ifndef __UTF8_DFA_DECODER_H
#define __UTF8_DFA_DECODER_H
namespace Utf8DfaDecoder {
enum State : uint8_t {
kReject = 0,
kAccept = 12,
kTwoByte = 24,
kThreeByte = 36,
kThreeByteLowMid = 48,
kFourByte = 60,
kFourByteLow = 72,
kThreeByteHigh = 84,
kFourByteMidHigh = 96,
};
static inline void Decode(uint8_t byte, State* state, uint32_t* buffer) {
// This first table maps bytes to character to a transition.
static constexpr uint8_t transitions[] = {
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 00-0F
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 10-1F
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 20-2F
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 30-3F
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 40-4F
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 50-5F
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 60-6F
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 70-7F
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 80-8F
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // 90-9F
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // A0-AF
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // B0-BF
9, 9, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, // C0-CF
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, // D0-DF
10, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 5, 5, // E0-EF
11, 7, 7, 7, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, // F0-FF
};
// This second table maps a state to a new state when adding a transition.
// 00-7F
// | 80-8F
// | | 90-9F
// | | | A0-BF
// | | | | C2-DF
// | | | | | E1-EC, EE, EF
// | | | | | | ED
// | | | | | | | F1-F3
// | | | | | | | | F4
// | | | | | | | | | C0, C1, F5-FF
// | | | | | | | | | | E0
// | | | | | | | | | | | F0
static constexpr uint8_t states[] = {
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // REJECT = 0
12, 0, 0, 0, 24, 36, 48, 60, 72, 0, 84, 96, // ACCEPT = 12
0, 12, 12, 12, 0, 0, 0, 0, 0, 0, 0, 0, // 2-byte = 24
0, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 0, // 3-byte = 36
0, 24, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 3-byte low/mid = 48
0, 36, 36, 36, 0, 0, 0, 0, 0, 0, 0, 0, // 4-byte = 60
0, 36, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 4-byte low = 72
0, 0, 0, 24, 0, 0, 0, 0, 0, 0, 0, 0, // 3-byte high = 84
0, 0, 36, 36, 0, 0, 0, 0, 0, 0, 0, 0, // 4-byte mid/high = 96
};
DCHECK_NE(*state, State::kReject);
uint8_t type = transitions[byte];
*state = static_cast<State>(states[*state + type]);
*buffer = (*buffer << 6) | (byte & (0x7F >> (type >> 1)));
}
} // namespace Utf8DfaDecoder
#endif /* __UTF8_DFA_DECODER_H */
......@@ -113,8 +113,8 @@ unsigned Utf8::Encode(char* str,
uchar Utf8::ValueOf(const byte* bytes, size_t length, size_t* cursor) {
if (length <= 0) return kBadChar;
byte first = bytes[0];
// Characters between 0000 and 0007F are encoded as a single character
if (first <= kMaxOneByteChar) {
// Characters between 0000 and 007F are encoded as a single character
if (V8_LIKELY(first <= kMaxOneByteChar)) {
*cursor += 1;
return first;
}
......
This diff is collapsed.
......@@ -7,6 +7,7 @@
#include <sys/types.h>
#include "src/globals.h"
#include "src/third_party/utf8-decoder/utf8-decoder.h"
#include "src/utils.h"
/**
* \file
......@@ -129,6 +130,8 @@ class Utf16 {
class V8_EXPORT_PRIVATE Utf8 {
public:
using State = Utf8DfaDecoder::State;
static inline uchar Length(uchar chr, int previous);
static inline unsigned EncodeOneByte(char* out, uint8_t c);
static inline unsigned Encode(char* out,
......@@ -158,9 +161,9 @@ class V8_EXPORT_PRIVATE Utf8 {
static inline uchar ValueOf(const byte* str, size_t length, size_t* cursor);
typedef uint32_t Utf8IncrementalBuffer;
static uchar ValueOfIncremental(byte next_byte,
static uchar ValueOfIncremental(byte next_byte, size_t* cursor, State* state,
Utf8IncrementalBuffer* buffer);
static uchar ValueOfIncrementalFinish(Utf8IncrementalBuffer* buffer);
static uchar ValueOfIncrementalFinish(State* state);
// Excludes non-characters from the set of valid code points.
static inline bool IsValidCharacter(uchar c);
......
......@@ -1406,6 +1406,7 @@
'strtod.h',
'ic/stub-cache.cc',
'ic/stub-cache.h',
'third_party/utf8-decoder/utf8-decoder.h',
'tracing/trace-event.cc',
'tracing/trace-event.h',
'tracing/traced-value.cc',
......
......@@ -19,12 +19,16 @@ static int Ucs2CharLength(unibrow::uchar c) {
static int Utf8LengthHelper(const char* s) {
unibrow::Utf8::Utf8IncrementalBuffer buffer(unibrow::Utf8::kBufferEmpty);
unibrow::Utf8::State state = unibrow::Utf8::State::kAccept;
int length = 0;
for (; *s != '\0'; s++) {
unibrow::uchar tmp = unibrow::Utf8::ValueOfIncremental(*s, &buffer);
size_t i = 0;
while (s[i] != '\0') {
unibrow::uchar tmp =
unibrow::Utf8::ValueOfIncremental(s[i], &i, &state, &buffer);
length += Ucs2CharLength(tmp);
}
unibrow::uchar tmp = unibrow::Utf8::ValueOfIncrementalFinish(&buffer);
unibrow::uchar tmp = unibrow::Utf8::ValueOfIncrementalFinish(&state);
length += Ucs2CharLength(tmp);
return length;
}
......
......@@ -37,13 +37,15 @@ void DecodeNormally(const std::vector<byte>& bytes,
void DecodeIncrementally(const std::vector<byte>& bytes,
std::vector<unibrow::uchar>* output) {
unibrow::Utf8::Utf8IncrementalBuffer buffer = 0;
for (auto b : bytes) {
unibrow::uchar result = unibrow::Utf8::ValueOfIncremental(b, &buffer);
unibrow::Utf8::State state = unibrow::Utf8::State::kAccept;
for (size_t i = 0; i < bytes.size();) {
unibrow::uchar result =
unibrow::Utf8::ValueOfIncremental(bytes[i], &i, &state, &buffer);
if (result != unibrow::Utf8::kIncomplete) {
output->push_back(result);
}
}
unibrow::uchar result = unibrow::Utf8::ValueOfIncrementalFinish(&buffer);
unibrow::uchar result = unibrow::Utf8::ValueOfIncrementalFinish(&state);
if (result != unibrow::Utf8::kBufferEmpty) {
output->push_back(result);
}
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment