Re: Unicode and end users

From: John Cowan (cowan@mercury.ccil.org)
Date: Mon Feb 18 2002 - 14:39:28 EST


Lars Kristan scripsit:

> I need to store UNIX filenames in a UTF-16 database residing on Windows. If
> I use ANSI->Unicode, there is no problem. However, if I have a filesystem
> with filenames mainly in UTF-8? Nobody can guarantee that all of them will
> be in UTF-8. Some may still be in ANSI (well ISO). Actually, at some point
> in time, there will be UNIX servers with 50% of filenames in UTF-8 and 50%
> in ANSI (or something else for that matter).
>
> Hence my example of "ls > ls.out". My requirement is that there can be no
> data loss.

Frankly, your problem is insoluble, because you have set up self-contradictory
requirements. Suppose you are dealing with a filesystem where some names
are to be interpreted as Latin-1 and others as Latin-2. The kernel will
give you absolutely no help about which charset to use for which names,
nor are there any Unix utilities which would be able to cope. Filesystems
simply aren't meant to manage multiple charsets in names.

Suppose some names were ASCII and some EBCDIC: what would you be able
to do then? (EBCDIC file names couldn't include 2F, but since that
is U+0007 = BELL, it isn't much of a problem.)

The only way to ensure "no data loss" is to store file names as
uninterpreted byte sequences, and forget about characters altogether.
Which is what the kernel actually does: only 00 and 2F mean anything to it.

-- 
John Cowan           http://www.ccil.org/~cowan              cowan@ccil.org
To say that Bilbo's breath was taken away is no description at all.  There
are no words left to express his staggerment, since Men changed the language
that they learned of elves in the days when all the world was wonderful.
        --_The Hobbit_



This archive was generated by hypermail 2.1.2 : Mon Feb 18 2002 - 14:09:42 EST