Corrigendum #9

Markus Scherer at
Thu Jun 12 03:37:49 CDT 2014

On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson <public at>

> I have a something like a library that was written a long time ago (not by
> me) assuming that noncharacters were illegal in open interchange. Programs
> that use the library were guaranteed that they would not receive
> noncharacters in their input.  They thus are free to use any noncharacter
> internally as they wish.  Now that Corrigendum #9 has come out, I'm getting
> requests to update the library to not reject noncharacters.  The library
> itself does not use noncharacters.  If I (or someone else) makes the
> requested change, it may silently cause security holes in those programs
> that were depending on it doing the rejection, and who upgrade to use the
> new version.

If your library makes an explict promise to remove noncharacters, then it
should continue to do so.
However, if your library is understood to pass through any strings, except
for the advertised processing, then noncharacters should probably be

I don't see anything in the FAQ that really addresses this situation.  I
> think there should be an answer that addresses code written before the
> Corrigendum, and that goes into detail about the security issues. My guess
> is that the UTC did not really consider the potential for security holes
> when making this Corrigendum.

There is nothing really new in the corrigendum. The UTC felt that some
implementers had misinterpreted inconsistent and misleading statements in
and around the standard, and clarified the situation.

Any process that requires certain characters or sequences to not occur in
the input must explicitly check for those, regardless of whether they are
noncharacter, private use characters, unassigned code points, control
codes, deprecated language tag characters, discouraged stateful formatting
controls, stacks of hundreds of diacritics, or whatever.

In a sense, noncharacters are much like the old control codes. Some
terminals say "beep" when they see U+0007, or go into strange modes when
they see U+001B; on Windows, when you read a text file that contains
U+001A, it is interpreted as an end-of-file marker. If your process
depended on those things not happening, then you would have to strip those
control codes on input. But a pass-through-style library will be
universally expected not to do anything special with them.

I agree that CLDR should be able to use noncharacters for internal
> processing, and that they should be able to be stored in files and edited.
>  But I believe that version control systems and editors have just as much
> right to use noncharacters for their internal purposes.

I disagree. If svn or git choked on noncharacters or control codes or
private use characters or unassigned code points etc., I would complain.
Likewise, I expect to be able to use plain text or programming editors
(gedit, kate, vi, emacs, Visual Studio) to handle files with such
characters just fine.

I do *not* necessarily expect Word, OpenOffice, or Google Docs to handle
all of these.

Is CLDR constructed so there is no potential for conflicts here?  That is,
> does it reserve certain noncharacters for its own use?

I believe that CLDR only uses noncharacters for special purposes in
collation. In CLDR data files, there are at most contraction mappings that
start with noncharacters for purposes of building alphabetic-index tables.
(And those noncharacters are \u-escaped in CLDR XML files since CLDR 24.)
There is no mechanism to remove them from any input, but the worst thing
that would happen is that you get a sequence of code points to sort

The FAQ mentions using 0x7FFFFFFF as a possible sentinel.  I did not
> realize that that was considered representable in any UTF.  Likewise -1.

No, and that's the point of using those. Integer values that are not code
points make for great sentinels in API functions, such as a next() iterator
returning -1 when there is no next character.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list