From: Hans Aberg (haberg@math.su.se)
Date: Sun Nov 27 2005 - 11:45:23 CST
On 27 Nov 2005, at 16:03, Marcin 'Qrczak' Kowalczyk wrote:
> A common problem of programming languages which use Unicode for
> all its strings (either in the form of code points or UTF-16) is
> interfacing with Unix APIs based on byte strings, and representing
> filenames, environment variables, program invocation arguments etc.
> in the program.
>
> From the point of view of the OS they are arbitrary byte strings,
> usually excluding only NUL. From the point of view of the user they
> are generally meant to be interpreted as text. Their encoding is
> implicit; the locale setting provides a reasonable default. But even
> if the encoding intended to be UTF-8, the OS doesn't enforce that it
> is valid UTF-8. It's rare when filenames are not valid in the selected
> encoding, and most filenames are ASCII, so only very rare cases are
> truly problematic.
>
> How to convert these byte strings to Unicode?
This problem has recently been discussed in the POSIX/UNIX
standardization list (Austin Group List, http://www.opengroup.org/
austin/). It should really be best resolved there, because one needs
to find an efficient solution for a UTF-8 enabled UNIX OS, and in
doing that, one has to take things into account such as how to
implement efficient files systems. One possible approach might be to
ensure any byte string can be represented on the filesystems level,
with suitable UTF-8 encodings for use in text strings (and the
property that they can be lifted back to the original byte strings),
which may vary from context to context. This approach would be
motivated by the fact that almost all filesystems already work this
way, and that it would be inefficient to burden it with character
interpretation schemes. But some filesystems, though rare it seems,
use a different approach. And when fiddling around with this, one
needs to assess its effect on the total UNIX OS, probably making some
implementations first. In the meantime, I figure you can invent the
encoding schemes that best fits your needs.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Sun Nov 27 2005 - 11:46:54 CST