UTF-8, ISO C Am.1, and POSIX

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Tue Aug 12 1997 - 11:37:45 EDT

Keld J|rn Simonsen wrote on 1997-08-10 20:12 UTC:
> We have in the ISO POSIX WG been thru all POSIX standards to see
> what changes we should do to the standards to accompdate UCS.

I guess, pretty much the only thing required in the POSIX standard for UTF-8
is a standardized way to tell the locale mechanism that the character encoding
used is UTF-8. UTF-8 is a little bit more than yet another character
table, so there should be some locale flag or something like this that
allows me to tell libc that UTF-8 is the used encoding.

So far, my preliminary trick was that libc assumes UTF-8 encoding is used
if the name of the locale fits the regular expression "*[uU][tT][fF]-?8*"
in anticipation of what typical UTF-8 based locale names will look like,
but locale name (LANG, LC_CTYPE, etc.) parsing is probably not a nice
long-term solution, although many applications do this (I think, emacs
checks for the substring 8859 in LANG and LC_CTYPE).

What's the state of the standardization with regard to specifying in a
locale that we use UTF-8? How does enUS.UTF-8 look like?

It might also be useful, if POSIX would clairfy, how all the new
ISO C Am. 1 functions for wide streams and multi-byte strings work in
detail if we have selected the UTF-8 encoding in the locale. The
ISO C standard does not talk about UTF-8 and the multibyte string
concept is pretty abstract, so I feel implementors will have problems
coming up independently with compatible UTF-8 implementations of all the
ISO C Am.1 functions.

I'd be very interested in all work that has already been done in this
field, to avoid that we have to reinvent some wheels for Linux.


Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT