Re: UTF-8

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Jun 12 2001 - 12:09:29 EDT


Bill Kurmey wrote:
> Will the Unicode version of UTF-8 be registered with IANA and, if so, what
> will be its "charset" designation?

I believe this question is based on a misunderstanding:

"6-byte sequences" have been mentioned in this discussion. The intended meaning was "pairs of 3-byte sequences each encoding one surrogate".
This has always been illegal in Unicode and ISO 10646 if a supplementary code point was to be encoded.

Contrast this with "6-byte sequence to encode a code point U-04000000..U-7fffffff". This is, strictly speaking:
- not allowed in Unicode because it disallows
  the use of code points >U-0010ffff
- not used in 10646 because it specifies in an amendment
  currently under ballot to never assign code points >U-0010ffff
  (and has previously only reserved some such space for private use)

These two kinds of "6-byte sequences" are entirely different and should not be confused.

So, for all intents and purposes, although the Unicode UTF-8 definition and the RFC/ISO UTF-8 definitions technically are somewhat different, given that they actually encode the same code point range they result in the same set of valid byte sequences.

The only point of discussion for a current difference may be the encoding of single, unpaired surrogates.

> Currently, if an email client receives a message with "Content Type:"
> containing "charset=UTF-8" and accepts up to 6 octets for each scalar
> value, it would be considered "Unicode compliant." If it generated
> messages with "Content Type:" containing "charset=UTF-8" and complied with
> the IETF specification (RFC 2279) for generating UTF-8 8-bit values, it
> would not be considered "Unicode compliant."

I don't see which part of RFC 2279 would make it not Unicode compliant for any valid text.

markus



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT