Corrigendum #9

Philippe Verdy verdy_p at
Mon Jun 2 17:55:31 CDT 2014

"reserved for CLDR" would be wrong in TUS, you have reached a borderline
where you are no longer handling plain text (stream of scalar values
assigned to code points), but binary data via a binary interface outside
TUS (handling streams of collation elements, whose representation is not
even bound to the ICU implementation of CLDR for its own definitions and
syntax for its tailorings).

CLDR data defines its own interface and protocol, it can reserve these code
points only for itself but not in TUS and no other conforming plain-text
application is expected to accept these reservations, so they can
**freely** mark them in error, replace them, or filter them out, or
interpret them differently for their own usage, using their own
specification and encapsulation mechanisms and specific **non-plain-text**
data types.

CLDR data transmitted in binary form that would embed these code points are
not transporting plain-text, this is still a binary datatype specific to
this application. CLDR data must remain isolated in its scope without
forcing other protocols or TUS to follow its practices.

Other applications may develop "gateway" interfaces to convert them to be
interoperable with ICU but they are not required to do that. If they do,
they will follow the ICU specifications, not TUS and this should not
influence their own way to handle what TUS describe as plain-text.

To make it clear, it is referable to just say in TUS that the behavior of
applications with non-characters is completely undefined and unpredictable
without an external specification, and these entities should not even be
considered as encodable in any standard UTFs (which can be freely be
replaced by another one without causing any loss or modification of the
represented plain-text). It should be possible to define other (non
standard) conforming UTFs which are completely unable to represent these
non-characters (as well as any unpaired surrogate). A conforming UTF just
needs to be able to represent streams of scalar values in their full
standard range (even without knowing if they are assigned or not or without
knowing their character properties).

You can/should even design CLDR to completely ovoid the use of
non-characters: it's up to it to define an encapsulation/escaping mechanism
that clearly separates what is standard plain-text in the content and what
is not and used for specific purpose in CLDR or ICU implementations.

2014-06-03 0:07 GMT+02:00 Shawn Steele <Shawn.Steele at>:

>  Except that, particularly the max-weight ones, mean that developers can
> be expected to use this as sentinels in code using ICU, which would
> preclude their use for other things?
> Which makes them more like “reserved for use in CLDR” than “noncharacters”?
> -Shawn
> *From:* Unicode [mailto:unicode-bounces at] *On Behalf Of *Markus
> Scherer
> *Sent:* Monday, June 2, 2014 2:53 PM
> *To:* David Starner
> *Cc:* Unicode Mailing List
> *Subject:* Re: Corrigendum #9
> On Mon, Jun 2, 2014 at 1:32 PM, David Starner <prosfilaes at>
> wrote:
>  I would especially discourage any web browser from handling
> these; they're noncharacters used for unknown purposes that are
> undisplayable and if used carelessly for their stated purpose, can
> probably trigger serious bugs in some lamebrained utility.
> I don't expect "handling these" in web browsers and lamebrained utilities.
> I expect "treat like unassigned code points".
> markus
> _______________________________________________
> Unicode mailing list
> Unicode at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list