From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 19 2005 - 13:49:51 CST
Hans Aberg wrote:
> >> You probaly mean that the overloaded UTF-BSS (or whatever the correct name
> >> is)
O.k., can we officially retire all the discussion of the nonexistent
name "UTF-BSS", which was an artifact of Philippe Verdy not correctly
recalling the name of "FSS-UTF" when he originally wrote a response
on this thread??
> >
> > I wonder if there's a "correct name" for it. It seems that the most correct
> > name for this traforms would be the reference to the old RFC describing it,
> > even if the title of the informative RFC gives "UTF-8" incorrectly; and even
> > if there's a symbolic name to refer it, but only as a local symbol pointing
> > to the bibliographic reference at end of the text.
>
> I think there is a gap in the standards to not give it a name.
Lookalike extensions of the bit-shifting principles used in UTF-8
to extend the scheme to being a way of converting 32-bit numbers
in general into byte streams that masquerade as UTF-8, and acquire
"BS" monikers like UTF-8BS, or CPBTF-8, or whatever, are *NOT*
welcome additions. They are pernicious, because they would inflict
on information processing applications byte streams that walk and
quack like UTF-8 ducks but are not, in fact, ducks.
> It makes
> discussions as it here difficult. Generally, standards just define what is
> legal, and does not provide names for what is outside it.
Read again. The Unicode Standard defines both unassigned code points
(valid code points that have not been designated a function, either
as an encoded character or some other function such as surrogate
code point) *and* it defines *ill-formed* code units in the character
encoding schemes, UTF-8, UTF-16, and UTF-32.
0xFF is an ill-formed code unit in UTF-8. Clearly defined, and clearly
given a name by the standard.
TUS 4.0, p. 76:
"Any UTF-32 code unit greater than 0010FFFF<sub>16</sub> is ill-formed."
> A name like
> CPBTF-8 ("code point to binary transformation format") seems more
> appropriate, since it not a transformation dealing with characters at all,
> but only dealing with how to transform code points into bytes.
This is an invalid distinction.
Definition D29 in TUS, 4.0, p. 74:
"D29 A Unicode encoding form assigns each Unicode scalar value to a
unique code unit sequence."
It is *not* "a transformation dealing with characters", but a mapping
between Unicode scalar values (short hand for, and synonymous
to 0000..D7FF, E000..10FFFF) to code unit sequences (bytes in the
case of UTF-8, 16-bit units [wydes] in the case of UTF-16, and
32-bit words in the case of UTF-32).
--Ken
This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 13:51:38 CST