Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 14:38:26 CST

Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
In reply to: Kenneth Whistler: "Re: 32'nd bit & UTF-8"
Next in thread: Christopher Fynn: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Christopher Fynn: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/19 20:49, Kenneth Whistler at kenw@sybase.com wrote:

>>>> You probaly mean that the overloaded UTF-BSS (or whatever the correct name
>>>> is)
>
> O.k., can we officially retire all the discussion of the nonexistent
> name "UTF-BSS", which was an artifact of Philippe Verdy not correctly
> recalling the name of "FSS-UTF" when he originally wrote a response
> on this thread??

I thought it was already retired.

>>> I wonder if there's a "correct name" for it. It seems that the most correct
>>> name for this traforms would be the reference to the old RFC describing it,
>>> even if the title of the informative RFC gives "UTF-8" incorrectly; and even
>>> if there's a symbolic name to refer it, but only as a local symbol pointing
>>> to the bibliographic reference at end of the text.
>>
>> I think there is a gap in the standards to not give it a name.
>
> Lookalike extensions of the bit-shifting principles used in UTF-8
> to extend the scheme to being a way of converting 32-bit numbers
> in general into byte streams that masquerade as UTF-8, and acquire
> "BS" monikers like UTF-8BS, or CPBTF-8, or whatever, are *NOT*
> welcome additions. They are pernicious, because they would inflict
> on information processing applications byte streams that walk and
> quack like UTF-8 ducks but are not, in fact, ducks.

I think you need to get an anchor in the world of real program. Check
<http://www.cl.cam.ac.uk/~mgk25/unicode.html>. There are a number of UNIX
tools that just process bytes, and will not be UTF-8 conformant in the sense
that the Unicode people have dreamt up it, even though they are perfectly
capable of processing UTF-8 data. One will in general not check that a file
is UTF-8 for the same reason that one does not check that it is ASCII. Only
some tools will do that.

A format like CPBTF-8 would have nothing to do with UTF-8, but UTF-8 would
have to do with in the sense that it is a specialization. It can't be hard
for people to understand.

>> It makes
>> discussions as it here difficult. Generally, standards just define what is
>> legal, and does not provide names for what is outside it.
>
> Read again. The Unicode Standard defines both unassigned code points
> (valid code points that have not been designated a function, either
> as an encoded character or some other function such as surrogate
> code point) *and* it defines *ill-formed* code units in the character
> encoding schemes, UTF-8, UTF-16, and UTF-32.

Yes, we know all know that those are illegal according to Unicode, but not
formally describable in another sense.

> 0xFF is an ill-formed code unit in UTF-8. Clearly defined, and clearly
> given a name by the standard.

And sure all the other values can be written as invalid hexadecimal values.
So much we already have figured out.

> TUS 4.0, p. 76:
>
> "Any UTF-32 code unit greater than 0010FFFF<sub>16</sub> is ill-formed."

The situation is the same as that the values > 0x7F are illegal in ASCII.
When people made ASCII, they fantasized it was the end of it, and that the
full 8 bits would never be used. At least Don Knuth says so. Now the Unicode
people evidently wants people to pretend that the values > 0x10FFFF don't
exist.

>> A name like
>> CPBTF-8 ("code point to binary transformation format") seems more
>> appropriate, since it not a transformation dealing with characters at all,
>> but only dealing with how to transform code points into bytes.
>
> This is an invalid distinction.
>
> Definition D29 in TUS, 4.0, p. 74:
>
> "D29 A Unicode encoding form assigns each Unicode scalar value to a
> unique code unit sequence."
>
> It is *not* "a transformation dealing with characters", but a mapping
> between Unicode scalar values (short hand for, and synonymous
> to 0000..D7FF, E000..10FFFF) to code unit sequences (bytes in the
> case of UTF-8, 16-bit units [wydes] in the case of UTF-16, and
> 32-bit words in the case of UTF-32).

My guess is that all that fits into a computer will be binary numbers and
transformations thereof. If you know of a counter example, please let me
know. But the point of computers seems also to be that humans can associate
these binary numbers with various human understandable structures. I believe
the point of Unicode is that one associates characters with the Unicode
numbers. So a CPBTF-8 would be transformation where the code points are not
thought to be associated with the Unicode characters, whereas I believe the
point with Unicode is that one does associate Unicode numbers with
characters.

Hans Aberg

Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
In reply to: Kenneth Whistler: "Re: 32'nd bit & UTF-8"
Next in thread: Christopher Fynn: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Christopher Fynn: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 14:39:22 CST