Re: 5 & 6 byte UTF-8 encodings?

From: John Cowan (cowan@locke.ccil.org)
Date: Wed Aug 18 1999 - 10:49:00 EDT


O'Leary, Sean (NJ) wrote:

> Are these 5 & 6 bytes encodings valid UTF-8? ...or... do they fall under
> the category of "Be generous in what you accept."?

They are valid according to the ISO 10646 view of UTF-8, but not
according to the Unicode view. Unicode strictly limits the
(32-bit) codepoint values to the range 0 to 0x10FFFF. ISO 10646
extends this range to 0x7FFFFFFF. However, none of these characters
will ever be used, except conceivably as private-zone characters.

For the record, the private zones look like this:

        0000E000-0000F8FF (Unicode and ISO, 6400 codepoints)
        000F0000-0010FFFF (Unicode and ISO, 131,072 codepoints)
        00E00000-00FFFFFF (ISO but not Unicode, 2,097,151 codepoints)
        60000000-7FFFFFFF (ISO but not Unicode, 536,870,911 codepoints)

So unless you expect to see more than 137,472 private-zone characters,
you can ignore 5-byte and 6-byte UTF-8.

-- 
	John Cowan	http://www.ccil.org/~cowan	cowan@ccil.org
Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau,
Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies.
			-- Coleridge / Politzer



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT