Re: 32'nd bit & UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 19 2005 - 13:49:51 CST

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

Previous message: Rick McGowan: "Public Review Issue update"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg wrote:

> >> You probaly mean that the overloaded UTF-BSS (or whatever the correct name
> >> is)

O.k., can we officially retire all the discussion of the nonexistent
name "UTF-BSS", which was an artifact of Philippe Verdy not correctly
recalling the name of "FSS-UTF" when he originally wrote a response
on this thread??

> >
> > I wonder if there's a "correct name" for it. It seems that the most correct
> > name for this traforms would be the reference to the old RFC describing it,
> > even if the title of the informative RFC gives "UTF-8" incorrectly; and even
> > if there's a symbolic name to refer it, but only as a local symbol pointing
> > to the bibliographic reference at end of the text.
>
> I think there is a gap in the standards to not give it a name.

Lookalike extensions of the bit-shifting principles used in UTF-8
to extend the scheme to being a way of converting 32-bit numbers
in general into byte streams that masquerade as UTF-8, and acquire
"BS" monikers like UTF-8BS, or CPBTF-8, or whatever, are *NOT*
welcome additions. They are pernicious, because they would inflict
on information processing applications byte streams that walk and
quack like UTF-8 ducks but are not, in fact, ducks.

> It makes
> discussions as it here difficult. Generally, standards just define what is
> legal, and does not provide names for what is outside it.

Read again. The Unicode Standard defines both unassigned code points
(valid code points that have not been designated a function, either
as an encoded character or some other function such as surrogate
code point) *and* it defines *ill-formed* code units in the character
encoding schemes, UTF-8, UTF-16, and UTF-32.

0xFF is an ill-formed code unit in UTF-8. Clearly defined, and clearly
given a name by the standard.

TUS 4.0, p. 76:

"Any UTF-32 code unit greater than 0010FFFF<sub>16</sub> is ill-formed."

> A name like
> CPBTF-8 ("code point to binary transformation format") seems more
> appropriate, since it not a transformation dealing with characters at all,
> but only dealing with how to transform code points into bytes.

This is an invalid distinction.

Definition D29 in TUS, 4.0, p. 74:

"D29 A Unicode encoding form assigns each Unicode scalar value to a
unique code unit sequence."

It is *not* "a transformation dealing with characters", but a mapping
between Unicode scalar values (short hand for, and synonymous
to 0000..D7FF, E000..10FFFF) to code unit sequences (bytes in the
case of UTF-8, 16-bit units [wydes] in the case of UTF-16, and
32-bit words in the case of UTF-32).

--Ken

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Previous message: Rick McGowan: "Public Review Issue update"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 13:51:38 CST