Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sun Dec 12 2004 - 06:40:59 CST

  • Next message: Peter Kirk: "Re: infinite combinations, was Re: Nicest UTF"

    Lars Kristan <lars.kristan@hermes.si> writes:

    >> Please make up your mind: either they are valid and programs are
    >> required to accept them, or they are invalid and programs are required
    >> to reject them.
    >
    > I don't know what they should be called. The fact is there shouldn't be any.
    > And that current software should treat them as valid. So, they are not valid
    > but cannot (and must not) be validated. As stupid as it sounds. I am sure
    > one of the standardizers will find a Unicodally correct way of putting it.

    I am sure they will not.

    There is a tension to migrate from processing strings in terms of
    bytes in some vaguely specified encoding to processing them in terms
    of code points of a known encoding, or even further: combining
    character sequences, graphemes etc.

    20 years ago the distinction was moot: a byte was a character, except
    for some specialied programs for handling CJK. Today when latin names
    with accented characters mixed with cyrillic names are not displayed
    correctly or not sorted according to lexicograpic conventions of some
    culture, the program can be considered broken. Unfortunately supporting
    this requires changing the paradigm. A font with 256 characters with
    byte-based rendering engine is not enough for a display, and for
    sorting it's no longer enough to compare a byte at a time.

    You are trying to stick with processing byte sequences, carefully
    preserving the storage format instead of preserving the meaning in
    terms of Unicode characters. This leads to less robust software
    which is not certain about the encoding of texts it processes and
    thus can't apply algorithms like case mapping without risking doing
    a meaningless damage to the text.

    > Today, two invalid UTF-8 strings compare the same in UTF-16, after a
    > valid conversion (using a single replacement char, U+FFFD) and they
    > compare different in their original form,

    Conversion should signal an error by default. Replacing errors by
    U+FFFD should be done only when the data is processed purely for
    showing it to the user, without any further processing, i.e. when it's
    better to show the text partially even if we know that it's corrupted.

    > Either you do everything in UTF-8, or everything in UTF-16. Not
    > always, but typically. If comparisons are not always done in the
    > same UTF, then you need to validate. And not validate while
    > converting, but validate on its own. And now many designers will
    > remember that they didn't. So, all UTF-8 programs (of that kind)
    > will need to be fixed. Well, might as well adopt my broken
    > conversion and fix all UTF-16 programs. Again, of that kind, not all
    > in general, so there are few. And even those would not be all
    > affected. It would depend on which conversion is used where. Things
    > could be worked out. Even if we would start changing all the
    > conversions. Even more so if a new conversion is added and only used
    > when specifically requested.

    I don't understand anything of this.

    > I cannot afford not to access the files.

    Then you have two choices:
    - Don't use Unicode.
    - Pretend that filenames are encoded in ISO-8859-1, and represent them
      as a sequence of code points U+0001..U+00FF. They will not be displayed
      correctly but the information will be preserved.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Sun Dec 12 2004 - 06:45:17 CST