Re: [unicode] UTF-c

From: mpsuzuki@hiroshima-u.ac.jp
Date: Sun Feb 20 2011 - 06:41:49 CST

Next message: Doug Ewell: "Re: UTF-c"

Previous message: Thomas Cropley: "UTF-c"
In reply to: Thomas Cropley: "UTF-c"
Next in thread: Doug Ewell: "Re: [unicode] UTF-c"
Reply: Doug Ewell: "Re: [unicode] UTF-c"
Maybe reply: William_J_G Overington: "Re: [unicode] UTF-c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Dear Thomas,

On Sun, 20 Feb 2011 21:47:19 +1100
Thomas Cropley <tomcropley@gmail.com> wrote:

>I have developed a new multi-byte character encoding for Unicode. It is
>similar to UTF-8 but it is more efficient at encoding non-ASCII alphabetic
>scripts. The attached UTF-c.htm file gives more details and the C++ program
>"UTF8_c.cpp" shows how UTF-c files may be processed.

In your proposal, the maximum length of the coded character
is 4, it is less than UTF-8's max length. It's interesting
idea.

I guess your proposal is designed for the convenience for
the people who feels US-ASCII compatibility is insufficient
and ISO 8859 variants compatibility is required.

I have 2 questions:

Q1) I guess, the easiest way for the people feeling like
    above is keeping to use existing ISO 8859 variants,
    not migrating to new encoding. Is there large group
    of the people who want to use both of ISO 8859
    compatible ENCODING, and the CHARACTERS out of them
    at the same time, and they are willing to switch
    their favorite softwares?

    In Japanese market, there is a large group of the
    people who want to use legacy ENCODING (like
    Microsoft Codepage 932) and the CHARACTERS out of
    them (like JIS X 0213:2004), but they cannot afford
    to pay for new softwares or don't want to migrate
    newer softwares. I'm interested in the situation
    in other countries.

Q2) One of the advantage of UTF-8 encoding is an error
    recovery: breaking an octet will break a character
    including it, but following character won't be broken.
    But your encoding seems to be, sorry, unsafe.
    U+0080 - U+00BF, U+0100 - U+107F are coded by similar
    2 octets, so removing 1 octet may change all following
    characters. I'm afraid that it's not welcomed by
    the people who switched to UTF-8 from ISO 2022 encoding
    to reduce the engineering cost of the stateful encoding.

Regards,
mpsuzuki

Next message: Doug Ewell: "Re: UTF-c"
Previous message: Thomas Cropley: "UTF-c"
In reply to: Thomas Cropley: "UTF-c"
Next in thread: Doug Ewell: "Re: [unicode] UTF-c"
Reply: Doug Ewell: "Re: [unicode] UTF-c"
Maybe reply: William_J_G Overington: "Re: [unicode] UTF-c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Feb 20 2011 - 06:45:54 CST