From: Neil Harris (firstname.lastname@example.org)
Date: Mon Nov 28 2005 - 13:49:02 CST
Hans Aberg wrote:
> On 28 Nov 2005, at 03:44, Doug Ewell wrote:
>> Whatever you guys decide, please let's not have any proposals to
>> "improve" UTF-8, or invent a mutant form of UTF-8, by giving it a way
>> to map these arbitrary byte sequences bijectively while
>> simultaneously retaining the existing properties of UTF-8. We had
>> that discussion a while back. The first one to suggest "fixing"
>> UTF-8 automatically loses.
> My guess is that it is simplest to store UTF-8 names as is as
> byte-strings on the low level, possibly with some information whether
> it is ASCII or UTF-8 (or possibly some encoding), which is important
> in UNIX. Then the problem arises what to do when low filenames appear
> which cannot be given UTF-8 interpretation. Letting the low level file
> handling having to bother with that seems to be a bad idea: it does
> not need that, and interpretations will just complicate and slow
> things down. So then the idea I presented is to simply encode this to
> consistent UTF-8 in way that the original byte string can be converted
> back. A UNIX context may though need more than one invertible
> byte-string UTF-8 encoding, say if one is considering filenames,
> filepaths or filepath sequences. The question is truly tricky though.
> One must think through waht will happen with all standard UNIX
> programs that interprets byte strings and character strings. So I
> would prefer to leave it to those UNIX experts to work it out.
> Hans Aberg
The set of ASCII strings is a proper subset of the set of UTF-8 strings,
so no information would need to be stored about which of those coding
was being used.
Now, ISO 8859-1, that's a different matter -- I suppose you could still
use the property that _almost all_ non-pure-ASCII ISO 8859-1 natural
language strings are not also valid UTF-8 strings for backwards
compatibility, and ditto for most other fixed 8-bit encodings, but I
certainly wouldn't be willing to trust my filesystem to this sort of hack.
This archive was generated by hypermail 2.1.5 : Mon Nov 28 2005 - 18:57:59 CST