From: Doug Ewell (doug@ewellic.org)
Date: Mon Feb 28 2011 - 11:12:55 CST
Petr Tomasek <tomasek at etf dot cuni dot cz> wrote:
> Hm, what about UTF-64? Allmost everyone has 64bit machines today...
Marco Cimarosti, a former co-offender in creating experimental
encodings, described UTF-64 in May 2001. It used 63 bits to encode a
block of either (a) nine 7-bit Basic Latin characters or (b) three
21-bit characters, one of which was presumably not Basic Latin, together
with a 64th bit to indicate the type of block.
Van's sarcastic algorithm brings up a few additional goals to add to my
list:
• code units align with machine boundaries (8, 16, 32 bits)
• unique encoded form for each character
• unique encoded form for each character in context, or for each text
• minimize or avoid state
Remember that one point of this list is to demonstrate that not all
goals can be met by a single encoding.
Speaking of goals, Thomas' claim that UTF-c "avoids over-long forms of
characters" turns out not to be true, since characters belonging to the
selected 64-block can still be encoded using the long form. Encouraging
users to use the shortest form (like UTF-8) is not the same as
syntactically not providing a non-shortest form (like UTF-16 and -32).
-- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
This archive was generated by hypermail 2.1.5 : Mon Feb 28 2011 - 11:17:39 CST