Re: Representing Unix filenames in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Tue Nov 29 2005 - 18:47:14 CST

Next message: Asmus Freytag: "Re: Character delta between Unicode 4.1 and 5.0"

Previous message: Chris Jacobs: "Re: Representing Unix filenames in Unicode"
In reply to: Chris Jacobs: "Re: Representing Unix filenames in Unicode"
Next in thread: Doug Ewell: "Re: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Chris Jacobs" <chris.jacobs@xs4all.nl> writes:

>> So how do you propose to map filenames to strings on Unix?
>>
>> I'm asking from the point of view of a runtime of a language which
>> represents strings as sequences of code points. It has no power to
>> change how Unix works, nor how people name their files.
>
> How about quoted-printable?

Let's see how it compares to U+0000-escaping:

+ uses an already established syntax (although it has not been used
in this context)

+ names with invalid combinations of bytes are more human-readable
than in other formats

- names with valid but non-ASCII characters are human-unreadable

- reading filenames from a text file or writing filenames to a text
  file will not "just work", because nobody else uses this convention;
  QP doesn't seem suitable as an encoding of contents of files,
  as applying it to regular prose mangles non-ASCII characters

- ASCII names containing "=" are not encoded in the obvious way,
so it's not a pure extension of ASCII filenames

- if all characters are permitted to be escaped, encoding "/" or ".."
  can break security; this could be fixed by disallowing escaping
  ASCII characters besides "=", but then it's no longer pure QP and
  the point about using already established rules doesn't apply.

I think GNOME libraries provide the possibility of using URLs
internally (I don't know details of how this behaves). This is quite
similar to QP in that it doesn't provide an illusion that the strings
used in the program and the strings used by the OS are represented in
the same way and can be passed between file contents and OS calls as
opaque data. Worse, it uses different rules for path manipulation than
the OS and most other programs.

I still like the hack of U+0000-escaping. I modified my implementation
such that only those sequences are permitted to be escaped which would
be invalid UTF-8 fragments. This establishes a bijection between all
filenames and a subset of strings. It is a superset of the bijection
between filenames which are valid UTF-8 and strings decoded from them
according to true UTF-8. So this convention handles a strict superset
of files than using pure UTF-8 would, and for files handled by both
they behave the same. Why it is bad (otherwise than it dares to use
something else than true UTF-8)?

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Asmus Freytag: "Re: Character delta between Unicode 4.1 and 5.0"
Previous message: Chris Jacobs: "Re: Representing Unix filenames in Unicode"
In reply to: Chris Jacobs: "Re: Representing Unix filenames in Unicode"
Next in thread: Doug Ewell: "Re: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 29 2005 - 18:48:42 CST