Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 15 2004 - 08:06:11 CST

  • Next message: Lars Kristan: "RE: Roundtripping in Unicode"

    "Arcane Jill" <arcanejill@ramonsky.com> writes:

    > Unix makes is possible for /you/ to change /your/ locale - but by
    > your reasoning, this is an error, unless all other users do so
    > simultaneously.

    Not necessarily: you can change the locale as long as it uses the same
    default encoding.

    By "error" I mean "a bad idea". The system does not prevent from
    changing the locale to a different encoding. But then you are on your
    own and various things can break: terminal output will be mangled, you
    can't enter characters used in a different encoding from the keyboard,
    text files will be illegible, and Unicode programs which process texts
    may reject your data or even filenames. If you still need to change
    encodings, it's safer to use ASCII-only filenames.

    This situation is temporary. Well, it may last 10 more years or so,
    but it will probably gradually improve:

    First, more protocols and file formats are becoming aware of character
    encodings and either label them explicitly or use a known encoding
    (generally some Unicode encoding scheme). Especially protocols for
    data interchange over Internet: WWW, email, usenet, modern instant
    messaging protocols like Jabber. Some old protocols remain
    encoding-ignorant, e.g. irc and finger. GNOME 1 used the locale
    encoding, GNOME 2 uses UTF-8. Copying & pasting text in X window now
    has a separate API which uses UTF-8. While the irc protocol doesn't
    specify the encoding, the irssi client can now recode texts itself
    to conform to customs of particular channels.

    Second, UTF-8 is becoming more usable as the default encoding
    specified by the locale. I don't use it now because too many things
    still break, but it's improving: there are things which didn't work
    just a few years ago and work now. Terminal emulators in X widely
    support UTF-8 mode now. The curses library now has a working wide
    character API. Emacs and vi work in UTF-8 (Emacs still has problems).
    Readline now works in UTF-8. Localized messages (gettext) are now
    recoded automatically.

    Other programs still don't work. Bash works, while zsh and ksh don't.
    Most full-screen text programs use the narrow character curses API and
    don't work in UTF-8. Brokenness of interactive interpreters of various
    languages vary.

    BTW, in the wide character curses API, the only way curses can work
    in a UTF-8 terminal, characters are expressed as sequences of wchar_t
    (base char + some combining chars, possibly double width). Which means
    that you must somehow translate filenames to this representation
    in order to display them - same as with a Unicode-based GUI. It's
    meaningless to render arbitrary bytes on the terminal, and you can't
    force curses to emit the original byte sequences which form filenames
    (which would be a bad idea for control characters anyway). By
    legimitizing non-UTF-8 filenames in a UTF-8 system you increase
    problems to overcome by such applications: not only they have to
    show control characters somehow, but also invalid UTF-8.

    > But it goes beyond that. Copy a file onto a floppy disc and then
    > physically take that floppy disc to a different Unix machine and log
    > on as "guest" and insert the disc ... Will the filename look the same?

    Depends on the filesystem and the way it is mounted.

    For example if it's FAT with long filenames (which I think is the
    usual format for floppies even on Unix), filenames can be recoded by
    the kernel: you specify the encoding to present filenames in and the
    encoding of short names. I don't know what happens with filenames
    which are not expressible in the selected encoding.

    In this way filenames may automatically convert between systems which
    use different default encodings, preserving the character semantics
    rather than the byte representation. Of course file contents will not
    be converted.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 08:13:33 CST