Re: Representing Unix filenames in Unicode

From: Hans Aberg (haberg@math.su.se)
Date: Sun Nov 27 2005 - 23:58:08 CST

  • Next message: Hans Aberg: "Re: Representing Unix filenames in Unicode"

    On 28 Nov 2005, at 03:44, Doug Ewell wrote:

    > Whatever you guys decide, please let's not have any proposals to
    > "improve" UTF-8, or invent a mutant form of UTF-8, by giving it a
    > way to map these arbitrary byte sequences bijectively while
    > simultaneously retaining the existing properties of UTF-8. We had
    > that discussion a while back. The first one to suggest "fixing"
    > UTF-8 automatically loses.

    My guess is that it is simplest to store UTF-8 names as is as byte-
    strings on the low level, possibly with some information whether it
    is ASCII or UTF-8 (or possibly some encoding), which is important in
    UNIX. Then the problem arises what to do when low filenames appear
    which cannot be given UTF-8 interpretation. Letting the low level
    file handling having to bother with that seems to be a bad idea: it
    does not need that, and interpretations will just complicate and slow
    things down. So then the idea I presented is to simply encode this to
    consistent UTF-8 in way that the original byte string can be
    converted back. A UNIX context may though need more than one
    invertible byte-string UTF-8 encoding, say if one is considering
    filenames, filepaths or filepath sequences. The question is truly
    tricky though. One must think through waht will happen with all
    standard UNIX programs that interprets byte strings and character
    strings. So I would prefer to leave it to those UNIX experts to work
    it out.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Mon Nov 28 2005 - 01:22:54 CST