RE: any unicode conversion tools?

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Thu May 13 2004 - 04:22:28 CDT

Next message: Kent Karlsson: "RE: Phoenician"

Previous message: Philippe Verdy: "Re: interleaved ordering (was RE: Phoenician)"
In reply to: Peter Constable: "RE: any unicode conversion tools?"
Next in thread: Peter Constable: "RE: any unicode conversion tools?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Peter Constable wrote:

> UTF-8 sequences, as originally defined, could be longer than
> four bytes,
> in order to address codepoints in the vast expanse of UCS-4 at
> U+110000..U+FFFFFFFF. Since the accepted code space has been
> constrained
> to U+0000..U+10FFFF, only four bytes are needed. There are
> non-UTF-8s --
> beasts that kind of look like UTF-8 but aren't -- in which
> sequences of
> varying length represent the same character and sequences of more than
> four bytes appear, but they are not UTF-8; those byte sequences are
> considered illegal in UTF-8.

1. UCS-4, which is still defined by 10646 (but never by Unicode)
    is limited at U-7FFF FFFF (nitpick: for some reason it's "U-"
    not "U+"; don't ask me why). U-FFFF FFFF has always been
    out of range. Probably so that one could use "signed" 32-bit
    ints (not all p.l. have unsigned integer types).

2. That "original" definition of UTF-8 (which was never in Unicode)
    is still the definition of UTF-8 in 10646. So UTF-8/Unicode is
    not the same as UTF-8/10646. In practice it does not matter
    very much, since there are no (and will never be) any characters
    allocated above U+10FFFF, and the private use planes above
    U+10FFFF (which were specified in 10646) have been removed.

3. Another nitpick: To reach up to (and above...) U-FFFF FFFF in a
    UTF-8-like encoding would put the max number of bytes per
    char to 7. There would be no data bit in the first byte of a 7-byte
    sequence though, as it would consist exactly of 7 1s and 1 0. ;-)

/kent k

Next message: Kent Karlsson: "RE: Phoenician"
Previous message: Philippe Verdy: "Re: interleaved ordering (was RE: Phoenician)"
In reply to: Peter Constable: "RE: any unicode conversion tools?"
Next in thread: Peter Constable: "RE: any unicode conversion tools?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu May 13 2004 - 04:29:57 CDT