Re: Corrigendum #9 from Markus Scherer on 2014-06-12 (Unicode Mail List Archive)

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Thu, 12 Jun 2014 01:37:49 -0700

On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson <public_at_khwilliamson.com>
wrote:

> I have a something like a library that was written a long time ago (not by
> me) assuming that noncharacters were illegal in open interchange. Programs
> that use the library were guaranteed that they would not receive
> noncharacters in their input. They thus are free to use any noncharacter
> internally as they wish. Now that Corrigendum #9 has come out, I'm getting
> requests to update the library to not reject noncharacters. The library
> itself does not use noncharacters. If I (or someone else) makes the
> requested change, it may silently cause security holes in those programs
> that were depending on it doing the rejection, and who upgrade to use the
> new version.
>

If your library makes an explict promise to remove noncharacters, then it
should continue to do so.
However, if your library is understood to pass through any strings, except
for the advertised processing, then noncharacters should probably be
preserved.

I don't see anything in the FAQ that really addresses this situation. I
> think there should be an answer that addresses code written before the
> Corrigendum, and that goes into detail about the security issues. My guess
> is that the UTC did not really consider the potential for security holes
> when making this Corrigendum.
>

There is nothing really new in the corrigendum. The UTC felt that some
implementers had misinterpreted inconsistent and misleading statements in
and around the standard, and clarified the situation.

Any process that requires certain characters or sequences to not occur in
the input must explicitly check for those, regardless of whether they are
noncharacter, private use characters, unassigned code points, control
codes, deprecated language tag characters, discouraged stateful formatting
controls, stacks of hundreds of diacritics, or whatever.

In a sense, noncharacters are much like the old control codes. Some
terminals say "beep" when they see U+0007, or go into strange modes when
they see U+001B; on Windows, when you read a text file that contains
U+001A, it is interpreted as an end-of-file marker. If your process
depended on those things not happening, then you would have to strip those
control codes on input. But a pass-through-style library will be
universally expected not to do anything special with them.

I agree that CLDR should be able to use noncharacters for internal
> processing, and that they should be able to be stored in files and edited.
> But I believe that version control systems and editors have just as much
> right to use noncharacters for their internal purposes.

I disagree. If svn or git choked on noncharacters or control codes or
private use characters or unassigned code points etc., I would complain.
Likewise, I expect to be able to use plain text or programming editors
(gedit, kate, vi, emacs, Visual Studio) to handle files with such
characters just fine.

I do *not* necessarily expect Word, OpenOffice, or Google Docs to handle
all of these.

Is CLDR constructed so there is no potential for conflicts here? That is,
> does it reserve certain noncharacters for its own use?
>

I believe that CLDR only uses noncharacters for special purposes in
collation. In CLDR data files, there are at most contraction mappings that
start with noncharacters for purposes of building alphabetic-index tables.
(And those noncharacters are \u-escaped in CLDR XML files since CLDR 24.)
There is no mechanism to remove them from any input, but the worst thing
that would happen is that you get a sequence of code points to sort
interestingly.

The FAQ mentions using 0x7FFFFFFF as a possible sentinel. I did not
> realize that that was considered representable in any UTF. Likewise -1.
>

No, and that's the point of using those. Integer values that are not code
points make for great sentinels in API functions, such as a next() iterator
returning -1 when there is no next character.

markus

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Thu Jun 12 2014 - 03:39:31 CDT

This archive was generated by hypermail 2.2.0 : Thu Jun 12 2014 - 03:39:32 CDT