From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Dec 21 2004 - 01:43:34 CST
Mike Ayers wrote:
> Things that are impossible that I've noticed so far:
> - A metainformation system without holes in it.
UNIX filesytems (ok, old ones) are an example of an information system that
does not have metainformation about the encoding.
As for the holes, there are some gray areas in my solution, but they can be
worked out.
> - Addressing files with intermixed locales reliably.
> In a UTF-8 and ISO 8859-1 mixed environment, for instance,
> there is no way to know whether <c3> <a9> indicates "é" or
> "é". The Unix locale architecture does not permit mixed
> locales. What you propose is a locale of "ISO 8859-1 or
> UTF-8, your guess is as good as mine".
On UNIX, addressing files has nothing to do with locales. Each file can be
addressed reliably, in any locale (*). It is only the interpretation that is
not reliable. And UNIX locale architecture definitely DOES permit mixed
locales. Hence the issue. And the "ISO 8859-1 or UTF-8, your guess is as
good as mine" is not something I am trying to introduce. It is already
there. What I am trying, is to allow that confusion to endure a while
longer. Which is not bad in itself. I think it can actually help make it
quicker, not slower.
(*) MBCS can have some issues. Similar to those of UTF-8. But, A - a lot of
it does work, B - what doesn't is a pain, C - those users typically only mix
a MBCS and ASCII (so, no mix at all). Europe on the other hand, already
mixes several Latin encodings. When that gets mixed with UTF-8, problems
will be more frequent than they are with MBCS.
> - A scheme that translates all possible Unix
> filenames to unique and consistent Windows filenames. Case
> issues alone kill this.
Well, Windows actually does have the ability to handle filenames with case
sensitivity. But yes, it is not used widely.
A reliable translation of UNIX filenames to Windows filenames is just one of
possible goals (or uses) of my approach. If a 100% reliable solution cannot
be found, it does not mean that we shouldn't be looking for the next best
approach.
My specific requirements were to store UNIX filenames in a Windows database
and allow proper display of them, on Windows. Case issues, '*' in filenames
and such, all those represent no problem in that part of the requirements.
I've seen filenames consisting solely of a newline. And can deal with them.
But let's do talk about translating UNIX filenames to Windows filenames.
Users that need the interoperability have learned not to use tricky
filenames, not to use filenames that differ only in the case used (which is
also a bad idea in itself, it doesn't process well in our brain). So they
adapted and have no problems now. But they have been using legacy encodings.
Even more than one, especially when they have lots of files and are using a
language where only a few letters are non-ASCII and were always able to
figure out which file is which. It only affected the display, never
accessing. Well, a switch to UTF-8 will bring up lots of issues for them.
You think they will welcome the day and say "finally, I can solve this
mess". I think they will say "oh darn, it all worked before, is this really
necessary".
Getting rid of legacy encodings is a goal. But not for many users. For most
of them filenames are just a tool. Their business comes first. Some can't
afford to dedicate a day to convert all the filenames.
Lars
This archive was generated by hypermail 2.1.5 : Tue Dec 21 2004 - 01:45:16 CST