From: Hans Aberg (haberg@math.su.se)
Date: Sun Nov 27 2005 - 23:58:08 CST
On 28 Nov 2005, at 03:44, Doug Ewell wrote:
> Whatever you guys decide, please let's not have any proposals to
> "improve" UTF-8, or invent a mutant form of UTF-8, by giving it a
> way to map these arbitrary byte sequences bijectively while
> simultaneously retaining the existing properties of UTF-8. We had
> that discussion a while back. The first one to suggest "fixing"
> UTF-8 automatically loses.
My guess is that it is simplest to store UTF-8 names as is as byte-
strings on the low level, possibly with some information whether it
is ASCII or UTF-8 (or possibly some encoding), which is important in
UNIX. Then the problem arises what to do when low filenames appear
which cannot be given UTF-8 interpretation. Letting the low level
file handling having to bother with that seems to be a bad idea: it
does not need that, and interpretations will just complicate and slow
things down. So then the idea I presented is to simply encode this to
consistent UTF-8 in way that the original byte string can be
converted back. A UNIX context may though need more than one
invertible byte-string UTF-8 encoding, say if one is considering
filenames, filepaths or filepath sequences. The question is truly
tricky though. One must think through waht will happen with all
standard UNIX programs that interprets byte strings and character
strings. So I would prefer to leave it to those UNIX experts to work
it out.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Mon Nov 28 2005 - 01:22:54 CST