Representing Unix filenames in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sun Nov 27 2005 - 09:03:06 CST

  • Next message: Philippe Verdy: "Re: ISO 15924: zh-Hani for general Chinese (was: Different Arabic scripts?)"

    Hello.

    A common problem of programming languages which use Unicode for
    all its strings (either in the form of code points or UTF-16) is
    interfacing with Unix APIs based on byte strings, and representing
    filenames, environment variables, program invocation arguments etc.
    in the program.

    From the point of view of the OS they are arbitrary byte strings,
    usually excluding only NUL. From the point of view of the user they
    are generally meant to be interpreted as text. Their encoding is
    implicit; the locale setting provides a reasonable default. But even
    if the encoding intended to be UTF-8, the OS doesn't enforce that it
    is valid UTF-8. It's rare when filenames are not valid in the selected
    encoding, and most filenames are ASCII, so only very rare cases are
    truly problematic.

    How to convert these byte strings to Unicode? Here are various
    solutions:

    1. Don't convert: keep them as byte strings. Example: Python.

       Problems: coexisting Unicode strings and byte strings means that
       many places of the program must deal with this duality. The program
       can't easily embed a filename in a text output to the user, can't
       use functions which manipulate Unicode strings on filenames. The
       API is a poor match for Windows where UTF-16 more easily maps to
       Unicode than to byte strings. Since most programmers expect to be
       able to use native strings as filenames, especially given that
       most filenames are ASCII, the APIs grows duplicates which differ
       in unicodedness of strings, e.g. Python has os.getcwd() and
       os.getcwdu(). Constantly working with two string types is ugly.

    2. Represent filenames as opaque objects of system-dependent nature.
       Example: Common Lisp.

       It requires a very large API for all imaginable filename
       manipulations, some of which are unportable by nature anyway.
       Doesn't help with including filenames in textual output or mixing
       textual input with filenames: it still needs some conversion
       between filenames and strings, so while passing a filename between
       one OS function and another is smooth, looking inside filenames
       and composing filenames from parts is not, even if they are ASCII.

    3. Represent ordinary strings in UTF-8, expose this representation to
       the programmer, but don't enforce validity all the time. Filenames
       use the same representation as other strings. Example: GNOME.

       UTF-8 which doesn't have to be valid makes implementing all Unicode
       algorithms specified in terms of code points harder: not only they
       must deal with a variable-length encoding, but they also must
       decide how to behave for invalid input. Whether a string is meant
       to be UTF-8 is ambiguous, too often it will be something else.
       String transformation functions (e.g. case mapping) are quite
       unreliable.

    4. Assume that filenames are encoded in ISO-8859-1. Example: Perl
       (if its byte strings are interpreted as ISO-8859-1; Perl is
       ambiguous here, there are various possible interpretations
       depending on which packages are used).

       This causes non-ASCII filenames to be misrendered when showing to
       the user, and causes problems when filenames are input interactively.

    5. Convert the strings to Unicode and throw exceptions on invalid byte
       sequences.

       The program can't process some files and might fail in unexpected
       places.

    6. Convert the strings to Unicode and silently replace invalid byte
       sequences with U+FFFD. Example: Java (Sun).

       Filenames might silently got corrupt.

    7. Invent and use a UTF-8-compatible scheme for escaping arbitrary
       byte sequences to store them in Unicode strings. This is what Mono
       does for some low-level functions since recently:
       http://lists.ximian.com/pipermail/mono-devel-list/2005-October/015422.html
       (the escape character has been since then changed from U+FFFF to U+0000).

       It uses a non-standard encoding which coincides with UTF-8 for most
       data, which may cause confusion. Passing filenames to parts of the
       program written in a different language which uses Unicode but
       doesn't use this convention will break. Trying to show the filename
       on a medium which doesn't convert it back to a byte string (e.g. in
       a GUI) will work poorly or not at all. It's not clear how to decide
       when to use this encoding and when to use true UTF-8.

    Anything else?

    I'm especially interested what do you think about 7, because this is
    what I'm considering to adopt for my language Kogut. I used to do 5
    which meant that programs written in a straightforward way could not
    process some filenames in UTF-8 locales.

    Details of how I actually did 7 for now:

    - This encoding is chosen as the default encoding if the locale says
      UTF-8 and a special environment variable is set (KO_UTF8_ESCAPED_BYTES=1).

      The default encoding is used for things like filenames, and
      for converting stream contents if the encoding is not specified
      explicitly. If the program requests UTF-8 explicitly, it will
      always get the true UTF-8.

      Note: Mono uses this for a subset of functions, and it's the
      only encoding, it doesn't depend on the locale. I use it for most
      functions exchanging strings with C, but the encoding defaults
      to the locale encoding.

    - When converting from Unicode to byte strings, only U+0000 followed
      by another U+0000 or by a character between U+0080 and U+00FF gets
      converted. U+0000 followed by any other character is an error.

    - This encoding has the properties that any byte string converted
      to Unicode will yield back to the same byte string, that any valid
      UTF-8 byte string not containing 0x00 will be converted to the
      same Unicode string as with UTF-8, and that any Unicode string not
      containing U+0000 will be converted to the same byte string as with
      UTF-8.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Sun Nov 27 2005 - 09:06:26 CST