Representing Unix filenames in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sun Nov 27 2005 - 09:03:06 CST

Next message: Philippe Verdy: "Re: ISO 15924: zh-Hani for general Chinese (was: Different Arabic scripts?)"

Previous message: Tom Emerson: "Re: ISO 15924: zh-Hani for general Chinese (was: Different Arabic scripts?)"
Next in thread: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Reply: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Reply: Samuel Thibault: "Re: Representing Unix filenames in Unicode"
Maybe reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Maybe reply: Marcin 'Qrczak' Kowalczyk: "Re: Representing Unix filenames in Unicode"
Maybe reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Maybe reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Maybe reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Maybe reply: Philippe Verdy: "Fw: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hello.

A common problem of programming languages which use Unicode for
all its strings (either in the form of code points or UTF-16) is
interfacing with Unix APIs based on byte strings, and representing
filenames, environment variables, program invocation arguments etc.
in the program.

From the point of view of the OS they are arbitrary byte strings,
usually excluding only NUL. From the point of view of the user they
are generally meant to be interpreted as text. Their encoding is
implicit; the locale setting provides a reasonable default. But even
if the encoding intended to be UTF-8, the OS doesn't enforce that it
is valid UTF-8. It's rare when filenames are not valid in the selected
encoding, and most filenames are ASCII, so only very rare cases are
truly problematic.

How to convert these byte strings to Unicode? Here are various
solutions:

1. Don't convert: keep them as byte strings. Example: Python.

   Problems: coexisting Unicode strings and byte strings means that
   many places of the program must deal with this duality. The program
   can't easily embed a filename in a text output to the user, can't
   use functions which manipulate Unicode strings on filenames. The
   API is a poor match for Windows where UTF-16 more easily maps to
   Unicode than to byte strings. Since most programmers expect to be
   able to use native strings as filenames, especially given that
   most filenames are ASCII, the APIs grows duplicates which differ
   in unicodedness of strings, e.g. Python has os.getcwd() and
   os.getcwdu(). Constantly working with two string types is ugly.

2. Represent filenames as opaque objects of system-dependent nature.
Example: Common Lisp.

   It requires a very large API for all imaginable filename
   manipulations, some of which are unportable by nature anyway.
   Doesn't help with including filenames in textual output or mixing
   textual input with filenames: it still needs some conversion
   between filenames and strings, so while passing a filename between
   one OS function and another is smooth, looking inside filenames
   and composing filenames from parts is not, even if they are ASCII.

3. Represent ordinary strings in UTF-8, expose this representation to
the programmer, but don't enforce validity all the time. Filenames
use the same representation as other strings. Example: GNOME.

   UTF-8 which doesn't have to be valid makes implementing all Unicode
   algorithms specified in terms of code points harder: not only they
   must deal with a variable-length encoding, but they also must
   decide how to behave for invalid input. Whether a string is meant
   to be UTF-8 is ambiguous, too often it will be something else.
   String transformation functions (e.g. case mapping) are quite
   unreliable.

4. Assume that filenames are encoded in ISO-8859-1. Example: Perl
   (if its byte strings are interpreted as ISO-8859-1; Perl is
   ambiguous here, there are various possible interpretations
   depending on which packages are used).

This causes non-ASCII filenames to be misrendered when showing to
the user, and causes problems when filenames are input interactively.

5. Convert the strings to Unicode and throw exceptions on invalid byte
sequences.

The program can't process some files and might fail in unexpected
places.

6. Convert the strings to Unicode and silently replace invalid byte
sequences with U+FFFD. Example: Java (Sun).

Filenames might silently got corrupt.

7. Invent and use a UTF-8-compatible scheme for escaping arbitrary
   byte sequences to store them in Unicode strings. This is what Mono
   does for some low-level functions since recently:
   http://lists.ximian.com/pipermail/mono-devel-list/2005-October/015422.html
   (the escape character has been since then changed from U+FFFF to U+0000).

   It uses a non-standard encoding which coincides with UTF-8 for most
   data, which may cause confusion. Passing filenames to parts of the
   program written in a different language which uses Unicode but
   doesn't use this convention will break. Trying to show the filename
   on a medium which doesn't convert it back to a byte string (e.g. in
   a GUI) will work poorly or not at all. It's not clear how to decide
   when to use this encoding and when to use true UTF-8.

Anything else?

I'm especially interested what do you think about 7, because this is
what I'm considering to adopt for my language Kogut. I used to do 5
which meant that programs written in a straightforward way could not
process some filenames in UTF-8 locales.

Details of how I actually did 7 for now:

- This encoding is chosen as the default encoding if the locale says
UTF-8 and a special environment variable is set (KO_UTF8_ESCAPED_BYTES=1).

  The default encoding is used for things like filenames, and
  for converting stream contents if the encoding is not specified
  explicitly. If the program requests UTF-8 explicitly, it will
  always get the true UTF-8.

  Note: Mono uses this for a subset of functions, and it's the
  only encoding, it doesn't depend on the locale. I use it for most
  functions exchanging strings with C, but the encoding defaults
  to the locale encoding.

- When converting from Unicode to byte strings, only U+0000 followed
by another U+0000 or by a character between U+0080 and U+00FF gets
converted. U+0000 followed by any other character is an error.

- This encoding has the properties that any byte string converted
  to Unicode will yield back to the same byte string, that any valid
  UTF-8 byte string not containing 0x00 will be converted to the
  same Unicode string as with UTF-8, and that any Unicode string not
  containing U+0000 will be converted to the same byte string as with
  UTF-8.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Philippe Verdy: "Re: ISO 15924: zh-Hani for general Chinese (was: Different Arabic scripts?)"
Previous message: Tom Emerson: "Re: ISO 15924: zh-Hani for general Chinese (was: Different Arabic scripts?)"
Next in thread: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Reply: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Reply: Samuel Thibault: "Re: Representing Unix filenames in Unicode"
Maybe reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Maybe reply: Marcin 'Qrczak' Kowalczyk: "Re: Representing Unix filenames in Unicode"
Maybe reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Maybe reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Maybe reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Maybe reply: Philippe Verdy: "Fw: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Nov 27 2005 - 09:06:26 CST