From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 19 2005 - 16:02:16 CDT
From: "Kenneth Whistler" <kenw@sybase.com>
> Dean Snyder suggested:
>> Stateful mechanisms
>
> For bidirectional text, yes.
>
> But all extant schemes for the representation of bidirectional
> text involve stateful mechanisms. Would you care to supplant
> the last decade's work by the bidirectional committee and
> suggest a non-stateful mechanism that meets the same requirements
> for the representation of bidirectional text?
The only way I see to avoid stateful mechanisms with bidirectionnal scripts
would have been to use the visual left-to-right order throughout the
encoding. Needless to say, this still does not work well due to soft
end-of-lines in the middle of paragraphs, or the whole RTL paragraph must be
written in the opposite direction.
>> No support for a clean division between text and meta-text
>
> Would you care to suggest replacements for such widely
> implemented W3C standards as HTML and XML?
May be he suggests that Unicode encodes non-characters for this purpose of
delimiting textual and non-textual parts.
>> Legacy sludge
>
> This is the point on which I (and a number of other Unicode
> participants) are most likely to agree with you. The legacy
> sludge in Unicode was the cost of doing business, frankly.
> Legacy compatibility was what made the standard successful,
> because it could and can interoperate with the large number of bizarre
> experiments in character encoding which preceded it.
Thanks, I also approve the fact that Unicode and ISO/IEC 10646 can coexist
peacefully with all the many legacy encodings. Without it, conversions would
have been a nightmare and as much unpredictable as between past legacy
charsets. This means that almost all legacy charsets can be converted very
simply to Unicode (the reverse is not necessarily true of course), acting as
a compatible superset of allmost all these legacy charsets.
(But effectively it's hard to convert ISO 2022 to Unicode without using
stateful converters that know all the referenced charsets; and there are
some difficulties to convert other Teletext standard charsets that model
combining characters encoded BEFORE the base character, in a way similar to
deadkeys on European keyboards, the converter needs some lookahead to
reverse the encoded characters).
>> >How will the "something better" solve these problems without
>> >introducing new ones?
>>
>> Subsequent encoding efforts will be better because they will have
>> learned from the mistakes of earlier encoders ;-)
I hope this will not be a revolution, but mostly corrections to the
character model, and a better definition of canonical equivalence (if such
concept is still needed in the new standard, i.e. if it remains several
equivalent ways to encode the same abstract characters or grapheme
clusters).
>> Probably the single most important, and extremely simple, step to a
>> better encoding would be to force all encoded characters to be 4 bytes.
>
> Naive in the extreme. You do realize, of course, that the entire
> structure of the internet depends on protocols that manipulate
> 8-bit characters, with mandated direction to standardize their
> Unicode support on UTF-8?
I suppose he speaks about UTF-16, which may be deprecated effectively in
some time. I doubt too that UTF-8 will be deprecated soon, given that it has
no such difficulties like endianness problems (more or less solved using
BOM).
I would expect that only one form of UTF-32 will remain (most probably
little-endian, given that most processors produced taday are now going
little-endian, except Motorola/Apple/IBM PowerPC; but I wonder if PPC is not
already prepared to work natively with little-endian numbers, for example
with a endian-mode control bit set by the OS, as I don't know its
architecture and assembly language).
> The most serious mistake I see in the architectural resulted from
> the need to assign surrogates at D800..DFFF, instead of F800..FFFF.
> But it wasn't "hubris" that led to the prior assignment of
> a bunch of compatibility characters at FE30..FFEF -- just a lack
> of foresight about the eventual form of the surrogate mechanism.
And what about the non-characters at xFFFE and xFFFF? Would you have
assigned surrogates there? Then how would have we solved the endianness
"problem" for UTF-16 and UTF-32 if xFFFE and xFFFF were not already
non-characters, allowing the detection of BOM?
My opinion is that UTF-16 will not survive in some long term (unlike UTF-8
and UTF-32), when all processors will be 32-bit at least including in small
mobile devices and utility appliances. So surrogates will no more be
needed...
We will still need a BOM for UTF-32 only (coded 00 00 FE FF or FF FE 00 00),
as long as there will remain big-endian architectures. But still no place to
put surrogates at end of the 16-bit code unit space.
But I'm quite sure that something like UTF-24, with a fixed (little)
enddianness (with BOM unneeded and illegal, and possibly with ignored
trailing bytes for data alignment only in internal memory) may become
popular for serialization of Unicode text on protocol streams. It will be
simpler and faster to decode and allocate/store with predicatable sizes,
than UTF-8 which uses variable sizes.
This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 16:03:42 CDT