Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 15 2004 - 10:27:54 CST

  • Next message: Peter Kirk: "Re: Roundtripping Solved"

    Lars Kristan <lars.kristan@hermes.si> writes:

    > OK, strcpy does not need to interpret UTF-8. But strchr probably should.

    No. Its argument is a byte, even though it's passed as type int.
    By "byte" here I mean "C char value, which is an octet in virtually
    all modern C implementations; the C standard doesn't guarantee this
    but POSIX does".

    Many C functions are not suitable for processing UTF-8, or are
    suitable only as long as we consider all non-ASCII characters opaque
    bags of bytes. For example isalpha takes a byte, toupper transforms
    a byte to a byte, and strncpy copies up to n bytes even if it's
    in the middle of a UTF-8 character.

    There are wide character versions like iswalpha and towupper. But then
    data must be converted from a sequence of char to a sequence of wchar_t.
    Standard and semi-standard function which do this conversion for UTF-8
    reject invalid UTF-8 (they all have a mean for reporting errors).

    The assumption that wchar_t has something do to with Unicode is not as
    common as about char and bytes. I don't know whether FreeBSD finally
    changed their wchar_t to Unicode. And it can be UTF-32 (Unix) or
    UTF-16 (Windows).

    > But then all languages are supposed to provide functions for
    > processing opaque strings in addition to their Unicode functions.

    Yes, IMHO all general-purpose languages should support processing
    arrays of bytes, in addition to Unicode strings.

    It's not clear however how the API of filenames should look like,
    especially if they wish to be portable to Windows.

    > But sooner or later you need to incorporate the filename in some
    > UTF-8 text. An error report, for example.

    While it's not clear what a well-behaved application should do by
    default, in order to be 100% robust and preserve all information
    you must change the usual conventions anyway. Remember that any byte
    except "\0" and "/" is valid in a filename, so you must either escape
    some characters, or delimit the filename with "\0", or prefix it with
    the length, or something like this. A backup software should do this
    and not pay attention to the locale. But for end-user software like
    an image viewer, processing arbitrary filenames is less important.

    > What are stdin, stdout and argv (command line parameters) when a
    > process is running in a UTF-8 locale?

    Technically they are binary (command line arguments must not contain
    zero bytes). Users are expecting stdin and stdout to be treated as
    text or binary depending on the program, while command like arguments
    are generally interpreted as text or filenames.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 10:34:53 CST