RE: Nicest UTF

From: Lars Kristan ([email protected])
Date: Mon Dec 13 2004 - 06:06:38 CST

Next message: Lars Kristan: "RE: Nicest UTF"

Previous message: Arcane Jill: "Re: When to validate?"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: John Cowan: "Re: Nicest UTF"
Reply: John Cowan: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Marcin 'Qrczak' Kowalczyk wrote:
> > My my, you are assuming all files are in the same encoding.
>
> Yes. Otherwise nothing shows filenames correctly to the user.
UNIX is a multi user system. One user can use one locale and might never see
files from another user that uses a different locale. And users can even
have filenames in wrong locales in their own home directory. Copied from
somewhere. Perhaps only a letter here and there does not display correctly,
but this doesn't mean the user can't use the file.

>
> > And what about all the references to the files in scripts?
> > In configuration files?
>
> Such files rarely use non-ASCII characters. Non-ASCII characters are
> primarily used in names of documents created explicitly by the user.
Rarely. So only rare systems will not boot after the conversion. And only
rare programs will no longer work. Is that acceptable?

Plus, it might not be as rare as you think. It might be far more common in a
country where not many people understand English and are not using latin
letters on top of it.

Also, a script (a UNIX batch file) many have an ASCII name, but what if it
processes some user documents for some purpose. And has a set of filenames
hardcoded in it? What about MRU lists? What about documents that link other
documents?

Mass renaming is a dangerous thing. It should be done gradually and with
utmost care. And during this period, everything should keep working. If not,
users won't even start the process.

>
> > Soft links?
>
> They can be fixed automatically.
Ummmm, yes, not a good example. Except in case one decides to allow the user
to select an option to use U+FFFD instead of failing the conversion. Then
you need to be extra careful, rename any files that convert to a sinle name
and keep track of everything so you can use the right names for the soft
links. But yes, it can be done. If, on the other hand, you adopt the
'broken' conversion concept, you can convert all filenames, in a single
pass, and don't need to build lists of softlinks since you can convert them
directly.

>
> > If you want to break things, this is definitely the way to do it.
>
> Using non-ASCII filenames is risky to begin with. Existing tools don't
> have a good answer to what should happen with these files when the
> default encoding used by the user changes, or when a user using a
> different encoding tries to access them.
Not really. On UNIX, it is all very well defined. A filename is a sequence
of bytes which is only interpreted when it is displayed. You can place a
filename in a script or a configuration file and the file will be identified
and opened regardless of your locale setting.

People like you and me avoid non-ASCII filenames. But not all users do.

> Mozilla doesn't show such filenames in a directory listing. You
> may consider it a bug, but this is a fact. Producing non-UTF-8 HTML
> labeled as UTF-8 would be wrong too. There is no good solution to
> the problem of filenames encoded in different encodings.
There is no good solution. True. And I am trying to find one. And yes, I
would consider that a bug. They should probably use some escaping technique.
And, funny thing, you would probably accept the escaping technique. But if
you think about it, it is again representing invalid data with valid Unicode
characters. And if un-escaping needs to be done, it introduces all the
problems that you are pointing out for my 'broken' conversion. So, think of
my 128 codepoints as an escaping technique. One with no overhead. One with
little possibiliy of confusion. One that can be standardized and whoever
comes across it will know exactly what it is. Which is definitely not true
if we let each application devise its own escaping and there is no way they
can interoperate.

> > As soon as you realize you cannot convert filenames to UTF-8, you
> > will see that all you can do is start adding new ones in UTF-8.
> > Or forget about Unicode.
>
> I'm not using a UTF-8 locale yet, because too many programs don't
> support it.
Like Mozilla. I am showing you the way programs can be made to work with
UTF-8 faster and easier. And really by fixing them, not by rewriting them.
At least some programs, or some portions of programs. Then developers can
concentrate on the things that do require extra attention, like strupr,
isspace (or their equivalence).

> I'm using ISO-8859-2.
In fact you're lucky. Many ISO-8859-1 filenames display correctly in
ISO-8859-2. Not all users are so lucky.

> But almost all filenames are ASCII.
Basically, you are avoiding the problem alltogether. A wise decision. But it
also means you don't know as much about this problem as I do.

Lars

Next message: Lars Kristan: "RE: Nicest UTF"
Previous message: Arcane Jill: "Re: When to validate?"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: John Cowan: "Re: Nicest UTF"
Reply: John Cowan: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 06:14:00 CST