Re: Unicode support under Linux

From: Jungshik Shin (jshin@pantheon.yale.edu)
Date: Tue Jan 12 1999 - 23:16:43 EST


On Tue, 12 Jan 1999, Markus Kuhn wrote:

> The best convention I can think if is to search for the substring
> "UTF-8" in the environment variable LC_CTYPE, just like emacs is
> activating its 8-bit mode if it finds the string "8859" in LC_CTYPE.

  No, that's not a good idea. All necessary information is supposed to
be provided by the locale chosen by user. All application programs
need to do is call setlocale() and use appropriate set of API functions
like wc*tomb*(),mb*towc*(), and iswxxx() (instead of isxxx()) when
dealing with text.

> I would like to see a trivial application that has to count characters
> in text strings (e.g., "wc" or "more") to be made correctly UTF-8
> capable, as an example for C programmers to understand how to program
> correctly in a world where 1 byte == 1 character does not hold any more,
> because bytes of the form 10xxxxxx must not be counted as separate
> characters in UTF-8.

 What I wrote above has been done for years by commercial Unix (AIX,
Solaris 2.x/7, Digital Unix, IRIX, etc) for multibyte encodings used in
East Asia(i.e. with appropriate locales, just setting LANG or its
friends to the locale of one's choice makes all text utils and editors
work correctly in that locale). As I said several times(and
unfortnately, you don't seem to get it yet), UTF-8 is nothing other than
another multibyte encoding as far as Unix (text) manipulation tools are
concerned and in most cases(of course, there are areas where Unix locale
approach fails miserably dealing with Unicode) can be dealt with the
same method used to deal with other multibyte encodings such as EUC-xx,
Shift_JIS, etc(using appropriate API calls, wc*mb*()/mb*wc*() and
iconv()) once multibyte encoding support is firmly in place in C lib.(in
case of Linux, glibc) and approriate locale database for UTF-8 is
provided.

    Jungshik Shin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT