FSS-UTF, UTF-2, UTF-8, and UTF-16

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jun 12 2001 - 15:44:39 EDT


Mark said:

> UTF-8 was defined before UTF-16. At the time it was first defined, there
> were no surrogates, so there was no special handling of the D800..DFFF code
> points.

Technically, the first statement is not true.

UTF-2 and FSS-UTF *were* defined well before UTF-16. FSS-UTF was
defined on the range 0..0x7FFFFFFF, i.e. full 1- to 6-byte sequences.
FSS-UTF was published, for information only, as Appendix F, in
The Unicode Standard, Version 1.1, pp. 27-28, in 1993 (and prior
to that in an X/Open spec, after its invention (by Rob Pike?) at
Bell Labs). At that time,
Unicode was clearly a 16-bit standard only, so the only relevant
part of FSS-UTF that applied to Unicode were the 1- to 3-byte
sequences that transformed the range 0..0xFFFF.

[The earliest reference I find to UTF-2 is some archived Unicode mail
from Glenn Adams to Harald Avestrand, dated Nov. 9, 1992:

"Indeed, because the transformation method is in an
informative annex, and, because in Seoul, provisions were made to allow
for other transformation methods (by renaming this transform method to UTF-1),
all of these objections disappear, i.e., if you don't like UTF-1, define
UTF-2, .... In fact, AT&T Bell Labs has already defined a new transformation
method which they are currently calling FSS-UTF (File System Safe UTF)." ]

Mark's conclusion is true about FSS-UTF, which had no special handling
of D800..DFFF code points, for either 10646 (which it was nominally
aimed at) or Unicode, since at that time U+D800..U+DFFF were simply
normal unassigned code points.

However, UTF-8 (the nominal successor of FSS-UTF) and UTF-16 were
formally approved *simultaneously* as Amendments 2 and 1, respectively,
to 10646. The notices of DAM approval are SC2 N2664 (dated January 1996)
for DAM1 (UTF-16) and SC2 N2665 (dated January 1996) for DAM2 (UTF-8).
The corresponding first publication for UTF-8 in the Unicode Standard
is Appendix A.2 UTF-8, page A-7, in The Unicode Standard, Version 2.0,
also published in 1996.

UTF-8 as defined in Amendment 2 to 10646-1:1993 was algorithmically
identical to FSS-UTF, but that amendment is the first point at which
we get the wording "Values of x in the range 0000 D800 .. 0000 DFFF
are reserved for the TF-16 form and do not occur in UCS-4. ... The
mappings of these code positions in UTF-8 are undefined."
The Unicode Standard, Version 2.0 description of UTF-8 includes
the text, "Each code value (non-surrogates) is represented in
UTF-8 by 1, 2, or 3 bytes, depending on the code value. Pairs of
surrogates take 4 bytes."

So at the 1996 point of simultaneous publication of UTF-8 and UTF-16
in both 10646 and the Unicode Standard, D800..DFFF were no longer
normal unassigned code points, but were "RC-Elements" or "surrogates",
and *did* get special treatment in UTF-8.

RFC 2279, "UTF-8", is later, dated January 1998.

--Ken

> > Which original definition?
> >
> > Misha



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT