From: mpsuzuki@hiroshima-u.ac.jp
Date: Sun Feb 20 2011 - 06:41:49 CST
Dear Thomas,
On Sun, 20 Feb 2011 21:47:19 +1100
Thomas Cropley <tomcropley@gmail.com> wrote:
>I have developed a new multi-byte character encoding for Unicode. It is
>similar to UTF-8 but it is more efficient at encoding non-ASCII alphabetic
>scripts. The attached UTF-c.htm file gives more details and the C++ program
>"UTF8_c.cpp" shows how UTF-c files may be processed.
In your proposal, the maximum length of the coded character
is 4, it is less than UTF-8's max length. It's interesting
idea.
I guess your proposal is designed for the convenience for
the people who feels US-ASCII compatibility is insufficient
and ISO 8859 variants compatibility is required.
I have 2 questions:
Q1) I guess, the easiest way for the people feeling like
above is keeping to use existing ISO 8859 variants,
not migrating to new encoding. Is there large group
of the people who want to use both of ISO 8859
compatible ENCODING, and the CHARACTERS out of them
at the same time, and they are willing to switch
their favorite softwares?
In Japanese market, there is a large group of the
people who want to use legacy ENCODING (like
Microsoft Codepage 932) and the CHARACTERS out of
them (like JIS X 0213:2004), but they cannot afford
to pay for new softwares or don't want to migrate
newer softwares. I'm interested in the situation
in other countries.
Q2) One of the advantage of UTF-8 encoding is an error
recovery: breaking an octet will break a character
including it, but following character won't be broken.
But your encoding seems to be, sorry, unsafe.
U+0080 - U+00BF, U+0100 - U+107F are coded by similar
2 octets, so removing 1 octet may change all following
characters. I'm afraid that it's not welcomed by
the people who switched to UTF-8 from ISO 2022 encoding
to reduce the engineering cost of the stateful encoding.
Regards,
mpsuzuki
This archive was generated by hypermail 2.1.5 : Sun Feb 20 2011 - 06:45:54 CST