Re: FSS-UTF, UTF-2, UTF-8, and UTF-16

From: Mark Davis (mark@macchiato.com)
Date: Sat Jun 16 2001 - 13:13:41 EDT


You are correct about the published definitions. As I recall, though, we
were referring to UTF-FSS as UTF-8 in the UTC meetings before it was changed
to account for UTF-16.

In any event, I don't know whether Oracle was involved in those discussions
or not, or whether they introduced their tag "UTF8" before or after the
definition was changed.

Mark

----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: <mark@macchiato.com>
Cc: <unicode@unicode.org>; <kenw@sybase.com>
Sent: Tuesday, June 12, 2001 12:44
Subject: FSS-UTF, UTF-2, UTF-8, and UTF-16

> Mark said:
>
> > UTF-8 was defined before UTF-16. At the time it was first defined, there
> > were no surrogates, so there was no special handling of the D800..DFFF
code
> > points.
>
> Technically, the first statement is not true.
>
> UTF-2 and FSS-UTF *were* defined well before UTF-16. FSS-UTF was
> defined on the range 0..0x7FFFFFFF, i.e. full 1- to 6-byte sequences.
> FSS-UTF was published, for information only, as Appendix F, in
> The Unicode Standard, Version 1.1, pp. 27-28, in 1993 (and prior
> to that in an X/Open spec, after its invention (by Rob Pike?) at
> Bell Labs). At that time,
> Unicode was clearly a 16-bit standard only, so the only relevant
> part of FSS-UTF that applied to Unicode were the 1- to 3-byte
> sequences that transformed the range 0..0xFFFF.
>
> [The earliest reference I find to UTF-2 is some archived Unicode mail
> from Glenn Adams to Harald Avestrand, dated Nov. 9, 1992:
>
> "Indeed, because the transformation method is in an
> informative annex, and, because in Seoul, provisions were made to allow
> for other transformation methods (by renaming this transform method to
UTF-1),
> all of these objections disappear, i.e., if you don't like UTF-1, define
> UTF-2, .... In fact, AT&T Bell Labs has already defined a new
transformation
> method which they are currently calling FSS-UTF (File System Safe UTF)." ]
>
> Mark's conclusion is true about FSS-UTF, which had no special handling
> of D800..DFFF code points, for either 10646 (which it was nominally
> aimed at) or Unicode, since at that time U+D800..U+DFFF were simply
> normal unassigned code points.
>
> However, UTF-8 (the nominal successor of FSS-UTF) and UTF-16 were
> formally approved *simultaneously* as Amendments 2 and 1, respectively,
> to 10646. The notices of DAM approval are SC2 N2664 (dated January 1996)
> for DAM1 (UTF-16) and SC2 N2665 (dated January 1996) for DAM2 (UTF-8).
> The corresponding first publication for UTF-8 in the Unicode Standard
> is Appendix A.2 UTF-8, page A-7, in The Unicode Standard, Version 2.0,
> also published in 1996.
>
> UTF-8 as defined in Amendment 2 to 10646-1:1993 was algorithmically
> identical to FSS-UTF, but that amendment is the first point at which
> we get the wording "Values of x in the range 0000 D800 .. 0000 DFFF
> are reserved for the TF-16 form and do not occur in UCS-4. ... The
> mappings of these code positions in UTF-8 are undefined."
> The Unicode Standard, Version 2.0 description of UTF-8 includes
> the text, "Each code value (non-surrogates) is represented in
> UTF-8 by 1, 2, or 3 bytes, depending on the code value. Pairs of
> surrogates take 4 bytes."
>
> So at the 1996 point of simultaneous publication of UTF-8 and UTF-16
> in both 10646 and the Unicode Standard, D800..DFFF were no longer
> normal unassigned code points, but were "RC-Elements" or "surrogates",
> and *did* get special treatment in UTF-8.
>
> RFC 2279, "UTF-8", is later, dated January 1998.
>
> --Ken
>
> > > Which original definition?
> > >
> > > Misha
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT