Re: UTF-8 and POSIX

From: Keld J|rn Simonsen (keld@dkuug.dk)
Date: Sat Jun 26 1999 - 12:46:16 EDT


On Wed, Jun 23, 1999 at 12:36:48PM -0700, Markus Kuhn wrote:
> Keld J|rn Simonsen wrote on 1999-06-23 17:40 UTC:
> > On Wed, Jun 23, 1999 at 07:37:15AM -0700, Markus Kuhn wrote:
> > > Is there any work going on to review the POSIX.1 and POSIX.2 standards
> > > systematically to add proper UTF-8 support? For instance,
> > > the terminal driver can be set into a "cooked" mode where a
> > > single-line editing mechanism is applied before sending a line to an
> > > application, and the implementation of the erase function there has to
> > > know how many bytes to remove when a character is erased, which makes a
> > > difference between UTF-8 and ISO 8859-1 for instance. There should be a
> > > standard way to tell the terminal that it is in UTF-8 mode and has to
> > > perform character erase actions accordingly.
> >
> > Hmm, why should UTF-8 support differ here from say EUC support?
> > The support should be there already.
>
> I see neither EUC nor UTF-8 support in any POSIX document for system
> calls such as tcsetattr() that would allow me to tell the terminal in
> c_lflag|ICANON mode how many bytes to remove when it receives an ERASE
> character. I don't care much about EUC support, because this is not an
> ISO standard, but UTF-8 is one and should be fully and consistently
> supported here IMHO.

I am not aware of specific support for this in the POSIX standards.

> Vendors are setting up proprietary and non-portable solutions to work
> around such deficiencies in the POSIX standard regarding UTF-8. For
> example (quoting from an email from Tomas Vanhala
> <vanhala@ling.helsinki.fi>):
>
> I am curious of this, because at least on Solaris 7, it is also
> possible to utilize the UTF-8 locale support built into the OS.
>
> If you go to http://docs.sun.com/, choose the "Solaris 7 Software
> Developer Collection" and then the "Solaris Internationalization Guide
> For Developers", you will find that the document contains a section
> titled "Overview of en_US.UTF-8 Locale Support". The paragraph
> "TTY Environment Setup" of the subsection "System Environment"
> explains some UTF-8 specific STREAMS modules, e.g.
>
> /usr/kernel/strmod/eucu8 UTF-8 STREAMS module for tail side
> /usr/kernel/strmod/u8euc UTF-8 STREAMS module for head side
>
> Further down on the page, it is stated that:
>
> The dtterm(1) and any terminal that supports input and output of the
> UTF-8 codeset should have the following STREAMS configuration:
>
> head <-> ttcompat <-> u8euc <-> ldterm <-> eucu8 <-> pseudo-TTY
>
> This can be setup with strchg(1) user-level program, if the
> appropriate kernel modules have been loaded.
>
> Is this really specified by POSIX?

Not to my knowledge.

> The Linux version of stty and the tty driver in the kernel is currently
> being extended to accommodate for UTF-8. Unfortunatelly, POSIX.1:1996
> does not give us any guidance of how to do this in a portable way. (See
> <ftp://ftp.ilog.fr/pub/Users/haible/utf8/> for the patches.)
>
> > We have in WG20 enhanced the locale syntax to be able to cater for
> > ISO 10646 in the forthcoming ISO/IEC 14652 TR.
>
> Very interesting! URL???

http://www.dkuug.dk/jtc1/sc22/wg20/ and then see under 14652.

> > UTF-8 does not need to be implemented as a charmap, it could be
> > implemented as something special.
>
> If there is now really a new syntax defined to activate this "something
> special" in the locale definition files, than i am very happy to hear
> that and I am looking forward to see the details.

There is not such a new syntax for defining things like UTF-8.

> > > Anyone knowing on the current status of UTF-8 and POSIX?
> >
> > I wrote a paper on 10646 support for WG15, which is now
> > included in the current draft of TR 14766. It base idea was using UTF-8
> > as a standard in all POSIX standards.
>
> I know of
>
> http://www.cl.cam.ac.uk/~mgk25/ucs/iso-tr-14766.txt
>
> which I had to dig with Emacs artistic out of a proprietary word
> processing file format found on
>
> http://anubis.dkuug.dk/jtc1/sc22/wg15/iso14766/gnp3.wp

That paper was also available in .txt mode from the www.dkuug.dk site, url:
http://anubis.dkuug.dk/jtc1/sc22/wg15/iso14766/15

> Hm, but this contains not much that wasn't already obvious from the old
> USENIX Pike/Thompson Plan9/FSS-UTF paper in
>
> ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/UTF-8-Plan9-paper.ps.gz
>
> Is there an updated version of your paper available that also covers new
> less obvious stuff such as non-charmap processing in locale
> specifications and tcsetattr() kernel terminal driver configuration for
> UTF-8?

No, the paper you have referred is that latest issue.

Keld



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT