From: Frank Ellermann (nobody@xyzzy.claranet.de)
Date: Sat Jan 20 2007 - 19:32:59 CST
Ruszlan Gaszanov wrote:
> Any comments?
Some of your arguments like "won't need a BOM anymore" don't make
sense for me, but the UTF-24A idea is nice. Even if I'm lost in
a sequence of UTF-24A octets I can always find the start or end
of a UTF-24A code point: 1P0 can be 100 or 110, therefore bytes
with MSB 1 are the start unless the previous byte also has MSB 1,
and then the previous byte is the start. Similarly LSB 0 could
be used to determine the end.
One disadvantage of your scheme, unlike UTF-8 it can't be directly
expressed in CharmapML, the parity bit destroys simple patterns,
and an enumeration of 2**21 (minus surrogates) code points won't
fly. But BOCU-1 has the same issue, that's no showstopper.
Maybe you could use a trick, instead of 1P0 use 100 and 110 for
UTF-24E (even) and UTF-24O (odd) CharmapML descriptions, and a
comment that one half of the real UTF-24 corresponds to UTF-24E,
and the other half to UTF-24O.
Compare <http://purl.net/xyzzy/home/test/utf-8.xml> for one of my
two CharmapML experiments.
Frank
This archive was generated by hypermail 2.1.5 : Sat Jan 20 2007 - 19:47:45 CST