Re: Unicode support under Linux

From: Ulrich Drepper (drepper@cygnus.com)
Date: Tue Jan 12 1999 - 16:09:08 EST


Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> writes:

> The best convention I can think if is to search for the substring
> "UTF-8" in the environment variable LC_CTYPE, just like emacs is
> activating its 8-bit mode if it finds the string "8859" in LC_CTYPE.
>
> What do you think about that approach?

Well, the specified locale will have all the information. Of course
it can be incorporated in the locale name for stupid applications.
But in general the application should care at all.

> How is glibc 2.1 going to detect whether the character encoding is UTF-8
> or not? Same LC_CTYPE convention?

Every locale is compiled by the localedef program using a given
character set. This information is stored in the locale files and
therefore is available to glibc.

> Or should the application call some libc mb* function to test whether
> UTF-8 has been selected somehow via LC_CTYPE?

There shouldn't be a need to explicitly test it. Programs should as
soon as possible convert all text they are dealing with to wide
characters. It does not matter what the encoding here is. Programs
which don't have information about the text they read should assume
the user knows about this and selected the locale correctly. In this
case the mbsrtowcs et.al. functions should be used. If there is an
text with a specified charset to be read (e.g., MIME-encoded mail)
functions like iconv() can be used to transform the text.

Once all the text is in the internal format the program can use it
without knowing much about it. When the text has to be written out
(to a file or a terminal, doesn't matter where) the process is simply
reversed. wcsrtombs will be used to write something to a terminal
(since it assumes the charset for the selected locale) and writing in
a specific format can again be achieved by using iconv().

> I would like to see a trivial application that has to count characters
> in text strings (e.g., "wc" or "more") to be made correctly UTF-8
> capable, as an example for C programmers to understand how to program
> correctly in a world where 1 byte == 1 character does not hold any more,
> because bytes of the form 10xxxxxx must not be counted as separate
> characters in UTF-8.

As I've said, convert it using the mb*wc functions. They already know
how to make the conversion and with the wide character representation
you can use functions like iswspace() to find out what you need.

If I would have the time to finish my xterm rewrite you could see this
happen within a few days. The difficult thing is the glyph drawing
engine and the input method handling in the xterm.

-- 
---------------.      drepper at gnu.org  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Cygnus Solutions `--' drepper at cygnus.com   `------------------------



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT