Unicode Regular Expressions, Surrogate Points and UTF-8
richard.wordingham at ntlworld.com
Sat May 31 07:11:03 CDT 2014
On Sat, 31 May 2014 13:21:23 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> However CESU-8 can be detected by the initial encoding of another byte
> order mark U+1FFFE (which is a non-character that MUST be stripped
> once detected from the parsed stream of code points) However,
> documents starting by this non-cahracters are supposed to be
> non-interoperable by definition even though the presence of that
> special byte order mark would be very safe to secure CESU-8 and
> discriminate it from UTF-8.
Where is this tagging defined?
It is in general not true that non-characters must be stripped on
input. That would be highly inappropriate in a conversion program that
transformed between UTFs. Also, the collations defined in CLDR Version
23 file collation/zh.xml would be severely damaged if the
non-characters were stripped out. In version 24 and later the file
uses a different syntax and doesn't contain non-characters.
More information about the Unicode