From: Hans Aberg (haberg@math.su.se)
Date: Wed Apr 20 2005 - 18:55:09 CST
At 20:27 -0400 2005/04/20, Tom Emerson wrote:
>UTF-8 is a solution to the problem, though the depth of the automata
>increases and you may end up having to convert your existing UTF-16/32
>buffers to UTF-8 for lexing, then back again, dealing all the while
>with returning correct offsets during error processing. PCRE, for
>example, works in UTF-8, so if you want to use it on a UTF-16 buffer
>you need to convert both ways. A RPITA.
There is no problem using UTF-16/32 directly either, as they merely
will be interpreted as byte sequences. UTF-16 is quite irregular, and
is harder to use because of that. So a translator to UTF-8/32 is
probably to prefer. Then, UTF-8 will probably win over UTF-32, as it
has ASCII in its single bytes low 7 bits.
-- Hans Aberg
This archive was generated by hypermail 2.1.5 : Wed Apr 20 2005 - 19:01:34 CST