Jungshik Shin wrote on 1998-11-26 11:33 UTC:
> > The changes necessary would not be too significant. The major change is
> > that in order to count the number of characters in a string, you have to
> > count the bytes with (x & 0xc0) != 0xc0 instead of all bytes. The only
> > other significant problem with UTF-8 is the [] operator in regular
> > expressions, which currently assumes 1 byte = 1 character.
>
> That's what I thought and Mark Leisher's ucdata would be sufficient
> for the job(well, for regular expression handling would be beyond the
> range covered by it, for sure), but the author of one of several vi
> clones for Korean EUC encoding(EUC-KR) and JOHAB(another 1byte-2byte
> popular encoding for Korean that encodes all modern complete *and*
> _incomplete/partial_ syllables) claimed differently. I'll try to
> figure out .....
This depends on how far your UTF-8 support goes.
Things stay very easy with UTF-8 as long as you use only a level 1
subset of ISO 10646-1 that can be implemented in a simple fixed-width
(monospaced) left-to-right font. An example of such a subset are the
2800 characters in the new 6x13 fixed font for X11
<http://www.cl.cam.ac.uk/~mgk25/ ucs-fonts.html> (minus may be the Hebrew
characters in there). Significant structural modifications are however
necessary if you want to support bi-width fonts, bidi, combining characters,
Indian and Arabic presentation forms, etc. We should start with simple UTF-8
without these things first, and a level-1 fixed-width left-to-right UTF-8
editor can already easily handle a huge number of languages very adequately.
Markus
-- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT