• Iain Ireland's avatar
    [regexp] Fix and unify non-unicode case-folding algorithms · 3fab9d05
    Iain Ireland authored
    Non-unicode, case-insensitive regexps (e.g. /foo/i, not foo/iu) use a
    case-folding algorithm that doesn't quite match the Unicode
    definition. There are two places in irregexp that need to do
    case-folding. Prior to this patch, neither of them quite matched the
    spec (https://tc39.es/ecma262/#sec-runtime-semantics-canonicalize-ch).
    
    This patch implements the "Canonicalize" algorithm in
    src/regexp/special-case.h, and uses it in the relevant places. It
    replaces special-case logic around upper-casing / ASCII characters
    with the following approach:
    
    1. For most characters, calling UnicodeSet::closeOver on a set
       containing that character will produce the correct set of
       case-insensitive matches.
    
    2. For a small handful of characters (like the sharp S that prompted
       this change), UnicodeSet::closeOver will include some characters
       that should be omitted. For example, although closeOver('ß') =
       "ßẞ", uppercase('ß') is "SS", so step 3.e means that 'ß'
       canonicalizes to itself, and should not match 'ẞ'. In these cases,
       we can skip the closeOver entirely, because it will never add an
       equivalent character. These characters are in the IgnoreSet.
    
    3. For an even smaller handful of characters, UnicodeSet::closeOver
       will produce some characters that should be omitted, but also some
       characters that should be included. For example, closeOver('k') =
       "kKK" (lowercase k, uppercase K, U+212A KELVIN SIGN), but KELVIN
       SIGN should not match either of the other two (step 3.g). To handle
       this, we put such characters in the SpecialAddSet. In these cases,
       we closeOver the original character, but filter out the results
       that do not have the same canonical value.
    
    The computation of IgnoreSet and SpecialAddSet happens at build time,
    using the pre-existing gen-regexp-special-case.cc step.
    
    R=jgruber@chromium.org
    
    Bug: v8:10248
    Change-Id: I00d48b180c83bb8e645cc59eda57b01eab134f0b
    Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2072858Reviewed-by: 's avatarFrank Tang <ftang@chromium.org>
    Reviewed-by: 's avatarJakob Gruber <jgruber@chromium.org>
    Commit-Queue: Jakob Gruber <jgruber@chromium.org>
    Cr-Commit-Position: refs/heads/master@{#66641}
    3fab9d05
Name
Last commit
Last update
build_overrides Loading commit data...
custom_deps Loading commit data...
docs Loading commit data...
gni Loading commit data...
include Loading commit data...
infra Loading commit data...
samples Loading commit data...
src Loading commit data...
test Loading commit data...
testing Loading commit data...
third_party Loading commit data...
tools Loading commit data...
.clang-format Loading commit data...
.clang-tidy Loading commit data...
.editorconfig Loading commit data...
.flake8 Loading commit data...
.git-blame-ignore-revs Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
.gn Loading commit data...
.vpython Loading commit data...
.ycm_extra_conf.py Loading commit data...
AUTHORS Loading commit data...
BUILD.gn Loading commit data...
CODE_OF_CONDUCT.md Loading commit data...
COMMON_OWNERS Loading commit data...
DEPS Loading commit data...
ENG_REVIEW_OWNERS Loading commit data...
INFRA_OWNERS Loading commit data...
INTL_OWNERS Loading commit data...
LICENSE Loading commit data...
LICENSE.fdlibm Loading commit data...
LICENSE.strongtalk Loading commit data...
LICENSE.v8 Loading commit data...
LICENSE.valgrind Loading commit data...
MIPS_OWNERS Loading commit data...
OWNERS Loading commit data...
PPC_OWNERS Loading commit data...
PRESUBMIT.py Loading commit data...
README.md Loading commit data...
S390_OWNERS Loading commit data...
WATCHLISTS Loading commit data...
codereview.settings Loading commit data...