On Tuesday, April 2, 2002, at 01:24 , Nick Ing-Simmons wrote:
> Dan Kogai <dankogai@dan.co.jp> writes:
>>>>
>>>> I don't like the <UNNNN+UMMMM> part it will make the parsing messier.
>>>>
>>>> The \xYY\xYY is of course what I meant ;-)
>>>
>>> Not that much. It's just a regex after all.
>
> For _perl_ it is but if we are going to get IBM's ICU or others
> to back-port it then it is better to keep things clean.
Point well taken.
> So let us have yacc-like:
>
> from : codepoint
> | from codepoint
> ;
>
> codepoint : '<' 'U' hexdigits '>'
> ;
>
> to : octet
> | to octet
> ;
>
> octet : '\\' 'x' hexdigits
> ;
Your suggestion is
\xAA\xAA\xBB\xBB \xCC\xCC
for compound characters and leave
<U3000> \xA1\xA1
for an ordinary single character. Did I get this one correct?
But I still feel easy with a distinction between Unicode Character
(codepoint != UTF8 octet) and octets. And as for octets, which
representation do you think is correct? just UCS stacked or UTF-8?
Dan the Encode Maintainer
This archive was generated by hypermail 2.1.2 : Mon Apr 01 2002 - 12:33:19 EST