Re: UTF-8 validation rules

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Sep 10 2001 - 15:48:06 EDT


Carl,

>
> \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.

In current parlance (see Unicode 3.1, UAX #27), these are
"noncharacters", and you must account for the fact that
U+1FFFE..U+1FFFF
U+2FFFE..U+2FFFF
...
U+10FFFE..U+10FFFF

all have the same status as noncharacters.

With Unicode 3.2 (in the works), the 32 additional code points
at U+FDD0..U+FDEF go from unallocated status to noncharacters
as well.

UTF-8 (and UTF-16 and UTF-32) convertors must allow the conversion
of noncharacter code points, but may then allow the detection of
their noncharacter status. Noncharacters should not appear in
open interchange of Unicode textual data, but can have internal
usage unspecified by the standard.

Detection of the status of a code point as a noncharacter
(allocated, but unassigned to a character) or as a regular unassigned code
point (not allocated) is conceptually distinct from the
validation of the UTF-8 conversion per se.

--Ken



This archive was generated by hypermail 2.1.2 : Mon Sep 10 2001 - 16:42:45 EDT