Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sun Dec 12 2004 - 06:40:59 CST

Next message: Peter Kirk: "Re: infinite combinations, was Re: Nicest UTF"

Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
In reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Lars Kristan: "RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Lars Kristan <lars.kristan@hermes.si> writes:

>> Please make up your mind: either they are valid and programs are
>> required to accept them, or they are invalid and programs are required
>> to reject them.
>
> I don't know what they should be called. The fact is there shouldn't be any.
> And that current software should treat them as valid. So, they are not valid
> but cannot (and must not) be validated. As stupid as it sounds. I am sure
> one of the standardizers will find a Unicodally correct way of putting it.

I am sure they will not.

There is a tension to migrate from processing strings in terms of
bytes in some vaguely specified encoding to processing them in terms
of code points of a known encoding, or even further: combining
character sequences, graphemes etc.

20 years ago the distinction was moot: a byte was a character, except
for some specialied programs for handling CJK. Today when latin names
with accented characters mixed with cyrillic names are not displayed
correctly or not sorted according to lexicograpic conventions of some
culture, the program can be considered broken. Unfortunately supporting
this requires changing the paradigm. A font with 256 characters with
byte-based rendering engine is not enough for a display, and for
sorting it's no longer enough to compare a byte at a time.

You are trying to stick with processing byte sequences, carefully
preserving the storage format instead of preserving the meaning in
terms of Unicode characters. This leads to less robust software
which is not certain about the encoding of texts it processes and
thus can't apply algorithms like case mapping without risking doing
a meaningless damage to the text.

> Today, two invalid UTF-8 strings compare the same in UTF-16, after a
> valid conversion (using a single replacement char, U+FFFD) and they
> compare different in their original form,

Conversion should signal an error by default. Replacing errors by
U+FFFD should be done only when the data is processed purely for
showing it to the user, without any further processing, i.e. when it's
better to show the text partially even if we know that it's corrupted.

> Either you do everything in UTF-8, or everything in UTF-16. Not
> always, but typically. If comparisons are not always done in the
> same UTF, then you need to validate. And not validate while
> converting, but validate on its own. And now many designers will
> remember that they didn't. So, all UTF-8 programs (of that kind)
> will need to be fixed. Well, might as well adopt my broken
> conversion and fix all UTF-16 programs. Again, of that kind, not all
> in general, so there are few. And even those would not be all
> affected. It would depend on which conversion is used where. Things
> could be worked out. Even if we would start changing all the
> conversions. Even more so if a new conversion is added and only used
> when specifically requested.

I don't understand anything of this.

> I cannot afford not to access the files.

Then you have two choices:
- Don't use Unicode.
- Pretend that filenames are encoded in ISO-8859-1, and represent them
as a sequence of code points U+0001..U+00FF. They will not be displayed
correctly but the information will be preserved.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Peter Kirk: "Re: infinite combinations, was Re: Nicest UTF"
Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
In reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Lars Kristan: "RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Dec 12 2004 - 06:45:17 CST