Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 14:38:26 CST

  • Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/19 20:49, Kenneth Whistler at kenw@sybase.com wrote:

    >>>> You probaly mean that the overloaded UTF-BSS (or whatever the correct name
    >>>> is)
    >
    > O.k., can we officially retire all the discussion of the nonexistent
    > name "UTF-BSS", which was an artifact of Philippe Verdy not correctly
    > recalling the name of "FSS-UTF" when he originally wrote a response
    > on this thread??

    I thought it was already retired.

    >>> I wonder if there's a "correct name" for it. It seems that the most correct
    >>> name for this traforms would be the reference to the old RFC describing it,
    >>> even if the title of the informative RFC gives "UTF-8" incorrectly; and even
    >>> if there's a symbolic name to refer it, but only as a local symbol pointing
    >>> to the bibliographic reference at end of the text.
    >>
    >> I think there is a gap in the standards to not give it a name.
    >
    > Lookalike extensions of the bit-shifting principles used in UTF-8
    > to extend the scheme to being a way of converting 32-bit numbers
    > in general into byte streams that masquerade as UTF-8, and acquire
    > "BS" monikers like UTF-8BS, or CPBTF-8, or whatever, are *NOT*
    > welcome additions. They are pernicious, because they would inflict
    > on information processing applications byte streams that walk and
    > quack like UTF-8 ducks but are not, in fact, ducks.

    I think you need to get an anchor in the world of real program. Check
    <http://www.cl.cam.ac.uk/~mgk25/unicode.html>. There are a number of UNIX
    tools that just process bytes, and will not be UTF-8 conformant in the sense
    that the Unicode people have dreamt up it, even though they are perfectly
    capable of processing UTF-8 data. One will in general not check that a file
    is UTF-8 for the same reason that one does not check that it is ASCII. Only
    some tools will do that.

    A format like CPBTF-8 would have nothing to do with UTF-8, but UTF-8 would
    have to do with in the sense that it is a specialization. It can't be hard
    for people to understand.

    >> It makes
    >> discussions as it here difficult. Generally, standards just define what is
    >> legal, and does not provide names for what is outside it.
    >
    > Read again. The Unicode Standard defines both unassigned code points
    > (valid code points that have not been designated a function, either
    > as an encoded character or some other function such as surrogate
    > code point) *and* it defines *ill-formed* code units in the character
    > encoding schemes, UTF-8, UTF-16, and UTF-32.

    Yes, we know all know that those are illegal according to Unicode, but not
    formally describable in another sense.

    > 0xFF is an ill-formed code unit in UTF-8. Clearly defined, and clearly
    > given a name by the standard.

    And sure all the other values can be written as invalid hexadecimal values.
    So much we already have figured out.

    > TUS 4.0, p. 76:
    >
    > "Any UTF-32 code unit greater than 0010FFFF<sub>16</sub> is ill-formed."

    The situation is the same as that the values > 0x7F are illegal in ASCII.
    When people made ASCII, they fantasized it was the end of it, and that the
    full 8 bits would never be used. At least Don Knuth says so. Now the Unicode
    people evidently wants people to pretend that the values > 0x10FFFF don't
    exist.

    >> A name like
    >> CPBTF-8 ("code point to binary transformation format") seems more
    >> appropriate, since it not a transformation dealing with characters at all,
    >> but only dealing with how to transform code points into bytes.
    >
    > This is an invalid distinction.
    >
    > Definition D29 in TUS, 4.0, p. 74:
    >
    > "D29 A Unicode encoding form assigns each Unicode scalar value to a
    > unique code unit sequence."
    >
    > It is *not* "a transformation dealing with characters", but a mapping
    > between Unicode scalar values (short hand for, and synonymous
    > to 0000..D7FF, E000..10FFFF) to code unit sequences (bytes in the
    > case of UTF-8, 16-bit units [wydes] in the case of UTF-16, and
    > 32-bit words in the case of UTF-32).

    My guess is that all that fits into a computer will be binary numbers and
    transformations thereof. If you know of a counter example, please let me
    know. But the point of computers seems also to be that humans can associate
    these binary numbers with various human understandable structures. I believe
    the point of Unicode is that one associates characters with the Unicode
    numbers. So a CPBTF-8 would be transformation where the code points are not
    thought to be associated with the Unicode characters, whereas I believe the
    point with Unicode is that one does associate Unicode numbers with
    characters.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 14:39:22 CST