Roundtripping in Unicode (was RE: Invalid UTF-8 sequences)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Dec 11 2004 - 06:50:01 CST

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"

Previous message: Johannes Bergerhausen: "Re: US-ASCII (was: Re: Invalid UTF-8 sequences)"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Marcin 'Qrczak' Kowalczyk wrote:
> Lars Kristan <lars.kristan@hermes.si> writes:
>
> > Quite close. Except for the fact that:
> > * U+EE93 is represented in UTF-32 as 0x0000EE93
> > * U+EE93 is represented in UTF-16 as 0xEE93
> > * U+EE93 is represented in UTF-8 as 0x93 (_NOT_ 0xEE 0xBA 0x93)
>
> Then it would be impossible to represent sequences like
> U+EEEE U+EEBA U+EE93 in UTF-8, and conversion UTF-32 -> UTF-8
> -> UTF-32
> would not round-trip.
>
> Concatenation of UTF-8-encoded strings would not be equivalent to
> UTF-8-encoding of the concatenation of code points.
>

I am well aware of that fact. And have also stated so in my original mail:
--- quote ---
The other name for this is roundtripping. Currently, Unicode allows a
roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are several reasons
why a UTF-8=>UTF-16(32)=>UTF-8 roundtrip is more valuable, even if it means
that the other roundtrip is no longer guaranteed:
--- end quote ---

Let me go a bit further. A UTF-16=>UTF-8=>UTF-16 roundtrip is only required
for valid codepoints other than the surrogates. But it also works for
surrogates unless you explicitly and intentionally break it. One can choose
to do so, or one can choose not to do so. I would opt for the latter (i.e.
keep it working), according to my statement (in the thread "When to
validate") that validation should be separated from other processing, where
possible. Due to performance and practicality reasons a conversion function
can contain validation, but if it does, this behavior should be switchable.

A UTF-32=>UTF-8=>UTF-32 roundtrip is similar, except that 16-8-16 works even
with concatenation, while 32-8-32 can be broken with concatenation.

A UTF-32=>UTF-16=>UTF-32 roundtrip is not guaranteed (still talking about
surrogates), even without concatenation.

All this is known and presents no problems, or - only problems that can be
kept under control. So, by introducing another set of 128 'surrogates', we
don't get a new type of a problem, just another instance of a well known
one.

On the other hand, UTF-8=>UTF-16=>UTF-8 as well as UTF-8=>UTF-32=>UTF-8 can
be both achieved, with no exceptions. This is something no other roundtrip
can offer at the moment. This fact alone should be enough to make us want to
have it.

On top of it, I repeatedly stressed that it is UTF-8 data that has the
highest probablility of any of the following:
* contains portions that are not UTF-8
* is not really UTF-8, but user has UTF-8 set as default encoding
* is not really UTF-8, but was marked as such
* a transmission error not only changes data but also creates invalid
sequences

UTF-16 has a much smaller possibility of being 'corrupted'. There are no
legacy encodings that could creep it. The only thing that could creep in
would be LE/BE mixup, any 8-bit data or 32-bit data. But that doesn't
happen, because those mixups don't work even with ASCII text and are quickly
detected and prevented.

So, not only that UTF-32=>UTF-8=>UTF-32 (or UTF-16=>UTF-8=>UTF-16) don't
roundtrip as-is, it doesn't really matter whether they do or don't. As long
as they do for valid Unicode strings, containing no surrogates.

Which does not mean we do not need to define how the conversions behave for
the surrogates (and I don't mean the values of surrogates in a UTF-16). But
that is another issue.

> This is broken.
So, just as much as UTF-16 is broken.

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Previous message: Johannes Bergerhausen: "Re: US-ASCII (was: Re: Invalid UTF-8 sequences)"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 06:58:08 CST