RE: Nicest UTF

From: Lars Kristan (lars.kristan@hermes.si)
Date: Mon Dec 13 2004 - 06:06:38 CST

  • Next message: Lars Kristan: "RE: Nicest UTF"

    Marcin 'Qrczak' Kowalczyk wrote:
    > > My my, you are assuming all files are in the same encoding.
    >
    > Yes. Otherwise nothing shows filenames correctly to the user.
    UNIX is a multi user system. One user can use one locale and might never see
    files from another user that uses a different locale. And users can even
    have filenames in wrong locales in their own home directory. Copied from
    somewhere. Perhaps only a letter here and there does not display correctly,
    but this doesn't mean the user can't use the file.

    >
    > > And what about all the references to the files in scripts?
    > > In configuration files?
    >
    > Such files rarely use non-ASCII characters. Non-ASCII characters are
    > primarily used in names of documents created explicitly by the user.
    Rarely. So only rare systems will not boot after the conversion. And only
    rare programs will no longer work. Is that acceptable?

    Plus, it might not be as rare as you think. It might be far more common in a
    country where not many people understand English and are not using latin
    letters on top of it.

    Also, a script (a UNIX batch file) many have an ASCII name, but what if it
    processes some user documents for some purpose. And has a set of filenames
    hardcoded in it? What about MRU lists? What about documents that link other
    documents?

    Mass renaming is a dangerous thing. It should be done gradually and with
    utmost care. And during this period, everything should keep working. If not,
    users won't even start the process.

    >
    > > Soft links?
    >
    > They can be fixed automatically.
    Ummmm, yes, not a good example. Except in case one decides to allow the user
    to select an option to use U+FFFD instead of failing the conversion. Then
    you need to be extra careful, rename any files that convert to a sinle name
    and keep track of everything so you can use the right names for the soft
    links. But yes, it can be done. If, on the other hand, you adopt the
    'broken' conversion concept, you can convert all filenames, in a single
    pass, and don't need to build lists of softlinks since you can convert them
    directly.

    >
    > > If you want to break things, this is definitely the way to do it.
    >
    > Using non-ASCII filenames is risky to begin with. Existing tools don't
    > have a good answer to what should happen with these files when the
    > default encoding used by the user changes, or when a user using a
    > different encoding tries to access them.
    Not really. On UNIX, it is all very well defined. A filename is a sequence
    of bytes which is only interpreted when it is displayed. You can place a
    filename in a script or a configuration file and the file will be identified
    and opened regardless of your locale setting.

    People like you and me avoid non-ASCII filenames. But not all users do.

    > Mozilla doesn't show such filenames in a directory listing. You
    > may consider it a bug, but this is a fact. Producing non-UTF-8 HTML
    > labeled as UTF-8 would be wrong too. There is no good solution to
    > the problem of filenames encoded in different encodings.
    There is no good solution. True. And I am trying to find one. And yes, I
    would consider that a bug. They should probably use some escaping technique.
    And, funny thing, you would probably accept the escaping technique. But if
    you think about it, it is again representing invalid data with valid Unicode
    characters. And if un-escaping needs to be done, it introduces all the
    problems that you are pointing out for my 'broken' conversion. So, think of
    my 128 codepoints as an escaping technique. One with no overhead. One with
    little possibiliy of confusion. One that can be standardized and whoever
    comes across it will know exactly what it is. Which is definitely not true
    if we let each application devise its own escaping and there is no way they
    can interoperate.

    > > As soon as you realize you cannot convert filenames to UTF-8, you
    > > will see that all you can do is start adding new ones in UTF-8.
    > > Or forget about Unicode.
    >
    > I'm not using a UTF-8 locale yet, because too many programs don't
    > support it.
    Like Mozilla. I am showing you the way programs can be made to work with
    UTF-8 faster and easier. And really by fixing them, not by rewriting them.
    At least some programs, or some portions of programs. Then developers can
    concentrate on the things that do require extra attention, like strupr,
    isspace (or their equivalence).

    > I'm using ISO-8859-2.
    In fact you're lucky. Many ISO-8859-1 filenames display correctly in
    ISO-8859-2. Not all users are so lucky.

    > But almost all filenames are ASCII.
    Basically, you are avoiding the problem alltogether. A wise decision. But it
    also means you don't know as much about this problem as I do.

    Lars



    This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 06:14:00 CST