RE: any unicode conversion tools?

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Thu May 13 2004 - 04:22:28 CDT

  • Next message: Kent Karlsson: "RE: Phoenician"

    Peter Constable wrote:

    > UTF-8 sequences, as originally defined, could be longer than
    > four bytes,
    > in order to address codepoints in the vast expanse of UCS-4 at
    > U+110000..U+FFFFFFFF. Since the accepted code space has been
    > constrained
    > to U+0000..U+10FFFF, only four bytes are needed. There are
    > non-UTF-8s --
    > beasts that kind of look like UTF-8 but aren't -- in which
    > sequences of
    > varying length represent the same character and sequences of more than
    > four bytes appear, but they are not UTF-8; those byte sequences are
    > considered illegal in UTF-8.

    1. UCS-4, which is still defined by 10646 (but never by Unicode)
        is limited at U-7FFF FFFF (nitpick: for some reason it's "U-"
        not "U+"; don't ask me why). U-FFFF FFFF has always been
        out of range. Probably so that one could use "signed" 32-bit
        ints (not all p.l. have unsigned integer types).

    2. That "original" definition of UTF-8 (which was never in Unicode)
        is still the definition of UTF-8 in 10646. So UTF-8/Unicode is
        not the same as UTF-8/10646. In practice it does not matter
        very much, since there are no (and will never be) any characters
        allocated above U+10FFFF, and the private use planes above
        U+10FFFF (which were specified in 10646) have been removed.

    3. Another nitpick: To reach up to (and above...) U-FFFF FFFF in a
        UTF-8-like encoding would put the max number of bytes per
        char to 7. There would be no data bit in the first byte of a 7-byte
        sequence though, as it would consist exactly of 7 1s and 1 0. ;-)

                    /kent k



    This archive was generated by hypermail 2.1.5 : Thu May 13 2004 - 04:29:57 CDT