From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sun Nov 27 2005 - 09:03:06 CST
Hello.
A common problem of programming languages which use Unicode for
all its strings (either in the form of code points or UTF-16) is
interfacing with Unix APIs based on byte strings, and representing
filenames, environment variables, program invocation arguments etc.
in the program.
From the point of view of the OS they are arbitrary byte strings,
usually excluding only NUL. From the point of view of the user they
are generally meant to be interpreted as text. Their encoding is
implicit; the locale setting provides a reasonable default. But even
if the encoding intended to be UTF-8, the OS doesn't enforce that it
is valid UTF-8. It's rare when filenames are not valid in the selected
encoding, and most filenames are ASCII, so only very rare cases are
truly problematic.
How to convert these byte strings to Unicode? Here are various
solutions:
1. Don't convert: keep them as byte strings. Example: Python.
Problems: coexisting Unicode strings and byte strings means that
many places of the program must deal with this duality. The program
can't easily embed a filename in a text output to the user, can't
use functions which manipulate Unicode strings on filenames. The
API is a poor match for Windows where UTF-16 more easily maps to
Unicode than to byte strings. Since most programmers expect to be
able to use native strings as filenames, especially given that
most filenames are ASCII, the APIs grows duplicates which differ
in unicodedness of strings, e.g. Python has os.getcwd() and
os.getcwdu(). Constantly working with two string types is ugly.
2. Represent filenames as opaque objects of system-dependent nature.
Example: Common Lisp.
It requires a very large API for all imaginable filename
manipulations, some of which are unportable by nature anyway.
Doesn't help with including filenames in textual output or mixing
textual input with filenames: it still needs some conversion
between filenames and strings, so while passing a filename between
one OS function and another is smooth, looking inside filenames
and composing filenames from parts is not, even if they are ASCII.
3. Represent ordinary strings in UTF-8, expose this representation to
the programmer, but don't enforce validity all the time. Filenames
use the same representation as other strings. Example: GNOME.
UTF-8 which doesn't have to be valid makes implementing all Unicode
algorithms specified in terms of code points harder: not only they
must deal with a variable-length encoding, but they also must
decide how to behave for invalid input. Whether a string is meant
to be UTF-8 is ambiguous, too often it will be something else.
String transformation functions (e.g. case mapping) are quite
unreliable.
4. Assume that filenames are encoded in ISO-8859-1. Example: Perl
(if its byte strings are interpreted as ISO-8859-1; Perl is
ambiguous here, there are various possible interpretations
depending on which packages are used).
This causes non-ASCII filenames to be misrendered when showing to
the user, and causes problems when filenames are input interactively.
5. Convert the strings to Unicode and throw exceptions on invalid byte
sequences.
The program can't process some files and might fail in unexpected
places.
6. Convert the strings to Unicode and silently replace invalid byte
sequences with U+FFFD. Example: Java (Sun).
Filenames might silently got corrupt.
7. Invent and use a UTF-8-compatible scheme for escaping arbitrary
byte sequences to store them in Unicode strings. This is what Mono
does for some low-level functions since recently:
http://lists.ximian.com/pipermail/mono-devel-list/2005-October/015422.html
(the escape character has been since then changed from U+FFFF to U+0000).
It uses a non-standard encoding which coincides with UTF-8 for most
data, which may cause confusion. Passing filenames to parts of the
program written in a different language which uses Unicode but
doesn't use this convention will break. Trying to show the filename
on a medium which doesn't convert it back to a byte string (e.g. in
a GUI) will work poorly or not at all. It's not clear how to decide
when to use this encoding and when to use true UTF-8.
Anything else?
I'm especially interested what do you think about 7, because this is
what I'm considering to adopt for my language Kogut. I used to do 5
which meant that programs written in a straightforward way could not
process some filenames in UTF-8 locales.
Details of how I actually did 7 for now:
- This encoding is chosen as the default encoding if the locale says
UTF-8 and a special environment variable is set (KO_UTF8_ESCAPED_BYTES=1).
The default encoding is used for things like filenames, and
for converting stream contents if the encoding is not specified
explicitly. If the program requests UTF-8 explicitly, it will
always get the true UTF-8.
Note: Mono uses this for a subset of functions, and it's the
only encoding, it doesn't depend on the locale. I use it for most
functions exchanging strings with C, but the encoding defaults
to the locale encoding.
- When converting from Unicode to byte strings, only U+0000 followed
by another U+0000 or by a character between U+0080 and U+00FF gets
converted. U+0000 followed by any other character is an error.
- This encoding has the properties that any byte string converted
to Unicode will yield back to the same byte string, that any valid
UTF-8 byte string not containing 0x00 will be converted to the
same Unicode string as with UTF-8, and that any Unicode string not
containing U+0000 will be converted to the same byte string as with
UTF-8.
-- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
This archive was generated by hypermail 2.1.5 : Sun Nov 27 2005 - 09:06:26 CST