O'Leary, Sean (NJ) wrote:
> Are these 5 & 6 bytes encodings valid UTF-8? ...or... do they fall under
> the category of "Be generous in what you accept."?
They are valid according to the ISO 10646 view of UTF-8, but not
according to the Unicode view. Unicode strictly limits the
(32-bit) codepoint values to the range 0 to 0x10FFFF. ISO 10646
extends this range to 0x7FFFFFFF. However, none of these characters
will ever be used, except conceivably as private-zone characters.
For the record, the private zones look like this:
0000E000-0000F8FF (Unicode and ISO, 6400 codepoints)
000F0000-0010FFFF (Unicode and ISO, 131,072 codepoints)
00E00000-00FFFFFF (ISO but not Unicode, 2,097,151 codepoints)
60000000-7FFFFFFF (ISO but not Unicode, 536,870,911 codepoints)
So unless you expect to see more than 137,472 private-zone characters,
you can ignore 5-byte and 6-byte UTF-8.
-- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT