Re: [Encode] Compound Unicode Character Support in UCM

From: Dan Kogai (dankogai@dan.co.jp)
Date: Mon Apr 01 2002 - 11:48:49 EST


On Tuesday, April 2, 2002, at 01:24 , Nick Ing-Simmons wrote:
> Dan Kogai <dankogai@dan.co.jp> writes:
>>>>
>>>> I don't like the <UNNNN+UMMMM> part it will make the parsing messier.
>>>>
>>>> The \xYY\xYY is of course what I meant ;-)
>>>
>>> Not that much. It's just a regex after all.
>
> For _perl_ it is but if we are going to get IBM's ICU or others
> to back-port it then it is better to keep things clean.

Point well taken.

> So let us have yacc-like:
>
> from : codepoint
> | from codepoint
> ;
>
> codepoint : '<' 'U' hexdigits '>'
> ;
>
> to : octet
> | to octet
> ;
>
> octet : '\\' 'x' hexdigits
> ;

Your suggestion is

\xAA\xAA\xBB\xBB \xCC\xCC

for compound characters and leave

<U3000> \xA1\xA1

for an ordinary single character. Did I get this one correct?
But I still feel easy with a distinction between Unicode Character
(codepoint != UTF8 octet) and octets. And as for octets, which
representation do you think is correct? just UCS stacked or UTF-8?

Dan the Encode Maintainer



This archive was generated by hypermail 2.1.2 : Mon Apr 01 2002 - 12:33:19 EST