Re: Representing Unix filenames in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Tue Nov 29 2005 - 18:47:14 CST

  • Next message: Asmus Freytag: "Re: Character delta between Unicode 4.1 and 5.0"

    "Chris Jacobs" <chris.jacobs@xs4all.nl> writes:

    >> So how do you propose to map filenames to strings on Unix?
    >>
    >> I'm asking from the point of view of a runtime of a language which
    >> represents strings as sequences of code points. It has no power to
    >> change how Unix works, nor how people name their files.
    >
    > How about quoted-printable?

    Let's see how it compares to U+0000-escaping:

    + uses an already established syntax (although it has not been used
      in this context)

    + names with invalid combinations of bytes are more human-readable
      than in other formats

    - names with valid but non-ASCII characters are human-unreadable

    - reading filenames from a text file or writing filenames to a text
      file will not "just work", because nobody else uses this convention;
      QP doesn't seem suitable as an encoding of contents of files,
      as applying it to regular prose mangles non-ASCII characters

    - ASCII names containing "=" are not encoded in the obvious way,
      so it's not a pure extension of ASCII filenames

    - if all characters are permitted to be escaped, encoding "/" or ".."
      can break security; this could be fixed by disallowing escaping
      ASCII characters besides "=", but then it's no longer pure QP and
      the point about using already established rules doesn't apply.

    I think GNOME libraries provide the possibility of using URLs
    internally (I don't know details of how this behaves). This is quite
    similar to QP in that it doesn't provide an illusion that the strings
    used in the program and the strings used by the OS are represented in
    the same way and can be passed between file contents and OS calls as
    opaque data. Worse, it uses different rules for path manipulation than
    the OS and most other programs.

    I still like the hack of U+0000-escaping. I modified my implementation
    such that only those sequences are permitted to be escaped which would
    be invalid UTF-8 fragments. This establishes a bijection between all
    filenames and a subset of strings. It is a superset of the bijection
    between filenames which are valid UTF-8 and strings decoded from them
    according to true UTF-8. So this convention handles a strict superset
    of files than using pure UTF-8 would, and for files handled by both
    they behave the same. Why it is bad (otherwise than it dares to use
    something else than true UTF-8)?

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Tue Nov 29 2005 - 18:48:42 CST