Unicode Regular Expressions, Surrogate Points and UTF-8

Philippe Verdy verdy_p at wanadoo.fr
Mon Jun 2 06:44:04 CDT 2014

Your example would have been better explained by just saying that in Java,
the regexp represented in source code as "\\uD808\\uDF45" means matching
two successive 16-bit code units, and "\\uD808" or "\\uDF45" just matches

The "\\uNNNN" regexo notation (in source code, equivalentto "\uNNNN" in
string at runtime) does not designate necessarily a full code point.

Unlike the "\\x{NNNN}" and "." regexs which will necessarily match a full
code point in the target (even if it's an isolated surrogate).

But there's no way in Java to represent a target string that can store
arbitrary sequences of codepoints if you use the String type (this is not
specific to Java but applies as well to any language or runtime library
handling streams of 16-bit code units, including in C, C++, Python,
Javascript, PHP...).

The problem is then not in the way you write regexps, but in the way the
target string is encoded : it is not technically posible with 16-bit
streams to represent arbitrary sequences of codepoints, but only arbitrary
sequences of 16-bit code units (even if they aren't valid UTF-16 text). But
there's no problem at all to process valid UTF-16 streams.

Your "lead", "trail" and "one+" are representable in Java as arbitrary
16-bit streams but they do not represent not valid Unicode texts. On the
opposite all your "tests[]" strings are valid Unicode texts but their
interpretation as regexps are not necessarily valid regexps.

Each time you use single backslashes in a Java source-code string, there's
no warranty it will be a valid Unicode text even though it will compile
without problem as a valid 16-bit stream (and the same will be true in
other languages).

If you want to represent aribtrary sequences of codepoints in a target
text, you cannot use any UTF alone (it may be technically possible with
UTF-8 or UTF-32, but these are also invalid for these standard encodings),
without using an escaping mechanism such as the double backslashes like in
the notation of regexps. This escaping mechnims is then independant of the
actual runtime encoding used to transport the escaped streams within valid
Unicode texts.

In summary; arbitrary sequences of codepoints in a valid Unicode text
require escaping mechanism on top of the actual text encoding for the
storage or transport (there are other ways to escape arbitrary streams into
valid texts, including the U+NNNNNN notation, or Base64 or Hex or octal
representation of UTF-32, or Punycode. and many other technics used to
embed binary objets (UUCP, Postscript streams). In HTTP a few of them are
suported as standard "transport syntaxes". Terminal protocols (like VT220
and related, or Videotext) have since long used escape sequences (plus
controls like SI/SO encapsulation and isolated DLE escapes for transporting
8-bit data over a 7-bit stream)

Technically the Java strings at runtime are not plain text (unless they are
checked on input and the validaty conditions are not brokeb by some text
transforms like extraction ob substrings at arbitrary absolute positions,
or with error recovery with resynchronization after a failure or missing
data, where these errors are likely to occur because we have no warranty
that validity is kept during the exchange by matching preconditions and
postconditions), they are binary object (and this is also true for C/C++
standard strings, or PHP strings, or the content transported by an HTTP
session or a terminal protocol (defining also its own escaping mechanism
where needed).

If yuo develop a general purpose library in any language that can be reused
in arbitrary code, you cannot assume on input that all preconditions are
satisfied so you need to check the input. And you also have to be careful
about the design of your library to make sure that it respects the
postconditions (some library APIs are technically unsafe, notably
extracting substrings and almost blocked I/O using fixed size buffers such
as file I/O in filesystems that do not discritimate text files and binary
files (so that text files will use buffers with variable length only broken
at codepoint positions and not at arbitrary code unit positions.

As far as I know, there does not exist any filesystem that enforce code
point positions (unless it uses non-space efficient encodings with code
units wider than 20 bits (storage devices are optimized for code units wth
size that are a power of 2 in bytes, so you would finally use only files
whose sizes in bytes is a multiple of 4 and all random access file
positions also a multiple of 4 bytes.

You could also use 24-but storage code units with blocks limited to sectors
of 255 bytes with the extra byte only used as a filler or as a length
indicator in that sector (255 bytes would store 85 arbitrary code units of
24 bits but you will still need to check the value range of these code
units if you want to restrict the the U+0000.U+10FFFF codepoint space,
unless your application code handles all of the extra code units like
non-character code points)

However the filesystem may perform this check when writing text files so
that it could mark files that are valid Unicode strings by updating some
metadata (that metadata could be stored in the spare byte of the first
256-bytes sector or using a separate indexing database of compatible
files). You coudl do that also for in-memory temporary buffers by keeping
this info (this would allow to discriminate very early those streams that
are not plain-text without having to process them up to the end).

A relational database could aslso perform this check when creating indexes
on table keys so that it will know that it cano only return valid text for
any subselection in a table. In all cases, this will still be more
efficient for small storage than using any transport syntax or escaping,
but generally for moderate and large volumes, the transport syntax or
escaping mechanism often wins in terms of performance by minimizing the
volume of I/O, notably if theses I/O are very costly compared to data in
working memory or even in CPU data caches (bit only if this data is very
frequently reaccessed in this cache).

However, if your I/O is very slow compared to CPU and the data volume
sufficient large, it is always better to use UTF-32 in memory but storing
that data in a compressed stream (you can safely use a generic binary
compressor which will generally work better with UTF-32 compared to UTF-8
or UTF-16 for moderate and large volumes).

Simple algorithms like deflate or even basic Huffmann encoding will
generate excellent throughout with very modest CPU cost compared the huge
to I/O costs that such compression saved (and in tht case, even a range
check on input will have insignicant cost, thanks to branch prediction in
your code using the fast pipelined path only for valid texts, and the
slower non pipelined path only for exceptions and error handling most
processors and CPU caches use now excellent branch predictors, even if code
compilers can help them).

In summary, ranche checking should no longer be only a debugging option in
code (even for production code), even for internal libraries, its cost is
rapidly insignificant for large data volumes. Just design your algorithms
to minimize the number of state variables and minimize table lookups in
order to improve data locality, instead of local data sizes for just one or
a few code points or code units: just select runtime code units that can
fit in a single CPU register (almost all processors today have registers at
least 32-bit wide, so UTF-32 is not a problem for local processing in
native code).

2014-06-02 11:29 GMT+02:00 Mark Davis ☕️ <mark at macchiato.com>:

> > \uD808\uDF45 specifies a sequence of two codepoints.
> ​That is simply incorrect.​
> In Java (and similar environments), \uXXXX means a char (a UTF16 code
> unit), not a code point. Here is the difference. If you are not used to
> Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x
> with the replacement y in string. Backslashes in literals need escaping, so
> \x needs to be written in literals as \\x.
>     String[] tests = {"\\x{12345}", "\\uD808\\uDF45", "\uD808\uDF45",
> "«.»"};
>     String target =
>      "one: «\uD808\uDF45»\t\t" +
>     "two: «\uD808\uDF45\uD808\uDF45»\t\t" +
>     "lead: «\uD808»\t\t" +
>     "trail: «\uDF45»\t\t" +
>     "one+: «\uD808\uDF45\uD808»";
>     System.out.println("pattern" + "\t→\t" + target + "\n");
>     for (String test : tests) {
>       System.out.println(test + "\t→\t" + target.replaceAll(test, "§︎"));
>     }
> *​Output:*
> pattern → one: «��» two: «����» lead: «?» trail: «?» one+: «��?»
> \x{12345} → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
> \uD808\uDF45 → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
> �� → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
> «.» → one: §︎ two: «����» lead: §︎ trail: §︎ one+: «��?»
> The target has various combinations of code units, to see what happens.
> Notice that Java treats a pair of lead+trail as a single code point for
> matching (eg .), but also an isolated surrogate char as a single code point
> (last line of output). Note that Java's regex in addition allows \x{hex}
> for specifying a code point explicitly. It also has the syntax \uXXXX (in a
> literal the \ needs escaping) to specify a code unit; that is slightly
> different than the Java preprocessing. Thus the first two are equivalent,
> and replace "{" by "x". The last two are also equivalent—and fail—because a
> single "{" is a broken regex pattern.
>     System.out.println("{".replaceAll("\\u007B", "x"));
>     System.out.println("{".replaceAll("\\x{7B}", "x"));
>     System.out.println("{".replaceAll("\u007B", "x"));
>     System.out.println("{".replaceAll("{", "x"));
> Mark <https://google.com/+MarkDavis>
>  *— Il meglio è l’inimico del bene —*
> On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
>> On Sun, 1 Jun 2014 08:58:26 -0700
>> Markus Scherer <markus.icu at gmail.com> wrote:
>> > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a
>> > supplementary code point, but as long as you have a surrogate pair,
>> > it is treated as a code point in APIs that support them.
>> Wasn't obvious that in the following paragraph \uD808\uDF45 was a
>> pattern?
>> "Bear in mind that a pattern \uD808 shall not match anything in a
>> well-formed Unicode string. \uD808\uDF45 specifies a sequence of two
>> codepoints. This sequence can occur in an ill-formed UTF-32 Unicode
>> string and before Unicode 5.2 could readily be taken to occur in an
>> ill-formed UTF-8 Unicode string.  RL1.7 declares that for a regular
>> expression engine, the codepoint sequence <U+D808, U+DF45> cannot
>> occur in a UTF-16 Unicode string; instead, the code unit sequence <D808
>> DF45> is the codepoint sequence <U+12345 CUNEIFORM SIGN URU TIMES
>> KI>."
>> (It might have been clearer to you if I'd said '8-bit' and '16-bit'
>> instead of UTF-8 and UTF-16.  It does make me wonder what you'd call a
>> 16-bit encoding of arbitrary *codepoint* sequences.)
>> Richard.
>> _______________________________________________
>> Unicode mailing list
>> Unicode at unicode.org
>> http://unicode.org/mailman/listinfo/unicode
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140602/ea3afe25/attachment.html>

More information about the Unicode mailing list