From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Dec 21 2004 - 07:14:40 CST
Philippe Verdy wrote:
> No, it won't happen, because Unicode and ISO/IEC-10646
> already states that
> it encodes abstract characters.
I see that as a technicality. What matters are consequences of rules, not
rules themselves. Consequences of breaking a rule should be analyzed
(thoroughly and carefully) and if they are acceptable (manageable) and
usefulness is determined, then rules need to be reinterpreted. So, I think
UTC needs to interpret the rules, not follow them literally. The rest of us
should be allowed to try to interpret the rules on our own and make
suggestions. An attempt to break a rule should not constitute a show-stopper
for a useful concept. Especially not while analysis of the consequences is
still in progress.
>
> This finally mean that you want these codepoints recognized
> as characters
> sometimes,
And that is exactly how they should be treated by UTFs. And they already
are. There is no conflict there.
> but not when you perform the conversion with a
> transform-encoding-syntax. A transform-encoding-syntax must
> also not modify
> the codepoints represented by an encoding scheme (or
> charset), and UTFs have
> also the property of having a single representation of these
> codepoints
OK, so it introduces a multiple representation of the same codepoints. Every
escaping technique does that. And it is not a problem. All you need to do is
define the normalization procedure. And use it where it applies. In many
cases its use is not even necessary. Specifically, a Unicode system does not
need to (and should not) normalize the escape codepoints. The need for
normalization only needs to be determined for an application that uses the
TES itself, and applies only in few cases.
CESU-8 has similar problems. If it is misinterpreted as Unicode, it
self-normalizes if it trips through UTF-16. My data self normalizes if it
trips through NON-UTF-8 (or shall we call it MUTF-8, Mostly-UTF-8, at the
risk of being called a mutant:). CESU-8 is slightly simpler, because it
self-normalizes completely and it can also always be normalized back to a
CESU-8 representation. My conversion only normalizes partially (it only
normalizes completely if tripped length/3 times, in the worst case). Also,
after a full normalization you can no longer tell how many times it was
escaped in its original form. In real life, this is often acceptable and far
better than not being able to handle invalid sequences as gracefully as
MUTF-8 conversion does.
The above is a loose description of what happens. Not all cases are covered
systematically. But can be. You can define, for example, that escape
sequences that normalize to new escape sequences or to invalid sequences in
UTF-8 are valid (or expected). Those that normalize to other codepoints
could be considered as invalid, or ill-formed. But again, that only matters
in a few specific cases. It matters if you'd be handling users this way, but
doesn't if you are mapping filenames. Even less if one wants to apply this
technique to editing text files.
There are two options for using this technique:
A - You can treat it as 'use it in rare cases'. UTF-8 then remains what it
is and existing Unicode applications already treat those codepoints exactly
as they should.
B - You can start using it wherever you convert to or (well, and) from
UTF-8. Typically you need to do it in both directions or else you risk
over-escaping in one case and self-normalization in the other. The latter
can even be useful in some cases, specifically where graceful handling is
desired, but roundtripping is not required.
Now, case B is what I said I would not be trying to do. And that is -
replacing the UTF-8 conversions with a new conversion. But consequences of
that can be determined. In the long run it actually reduces the risks of
over-escaping and self-normalization. The major 'problem' that most people
brought up is that it threatens to introduce invalid sequences into UTF-8.
Which would mean that all UTF-8 readers would need to start handling them.
Perhaps. If they knew how, it wouldn't be that hard anyway. But then again,
what about the time period when they don't and what if they decide never to?
Well, does it really matter whether they got it directly from a corrupted
source or they got it from an application that managed to preserve the data
and reconstructed it? So, it is not introducing, it is preserving.
It is a question of signalling or raising an exception. Some applications
have no way of signalling an error. Signalling "as early as possible" is in
my opinion an excuse in this case. Signalling should be done at the point
where user can make decisions and is able to fix it. And even at that point,
you have users that do want the signalling and you have users that don't
want it. And the latter are the majority. From the perspective of a
standardizer that can be seen as unwise. But in real life, usability
prevails. Did you ever see that a ls command on UNIX would warn you about
the invalid sequences? Of course not. It would be completely unusable. Well,
the fact is many UTF-8 decoders (or renderers) don't even use the U+FFFD,
they simply drop the sequence. Very bad. But no matter how you improve it,
signalling will never be an option, not in ls, not while rendering. And
U+FFFD is not a very good option either.
> If you want to strictly limit the case where escaping of
> valid characters
> will happen, the best option you have in Unicode is to use
> PUAs which are
> the least likely to happen in original strings (of characters and
> non-characters), in absence of an explicit agreement.
Assigning new characters is then even better.
>
> Note that a Transfer-Encoding-Syntax, to be usable, requires
> an explicit
> mutual agreement to allow the conversion in either direction.
That explicit agreement is one of the things I am trying to avoid. It can be
avoided, and that is the intent of standards.
But I am not so sure this should be called TES after all. It has often been
suggested or implied that what I do is completely internal and enclosed. But
that is not true. I started by storing the filenames in UTF-16. But,
eventually, the filenames can be displayed on Windows. Or created in a
Windows filesystem (with a few additional restrictions compared to
displaying, but only those that had already existed before).
> PUA, or it may even ignore "silently" these PUAs in the
> rendered graphic,
> signaling elsewhere to the user that not all characters could
> be rendered
I would say, "may, but *only* if it signals". And the same goes for invalid
sequences. But is not done that way. Far too often. Lots of it will need to
be fixed. By using U+FFFD? There is a better way. Use 128 new characters.
You can look on all this from this end. First, allow (and provide a means
for) renderers to display an invalid UTF-8 sequence (for example in a ls
command). A useful thing. The rest comes naturally.
> There are tons of existing TES used everyday in many
> applications, and none
> of them required the allocation of distinct codepoints for
> the encoded
> strings they generate. Why do you want new characters for
> this mapping? It's
> not necessary as demonstrated by all the other existing TES...
Four reasons:
1 - Display. Having new characters (or, escape codepoints with their
appearance defined) allows the text to remain visually similar. Length of
the text is preserved in many cases, words are easier to read (deduce), and
line breaks cause less problems. All pretty similar to how mixed encoding
environments have behaved all this time. No other escaping technique can
provide this. BTW, U+FFFD can, but is lossy.
2 - Other escaping techniques do not retain the usual assumption that UTF-16
is at most twice as big (in bytes) as UTF-8, or MUTF-8. Which can lead to
bugs and increased memory consumption.
3 - The PUA solution works well, but has some inherent risks. And cannot be
standardized.
4 - Anyone that will encounter the same problem that I have encountered
might devise a new escaping technique. Adding a few kilos to those tons.
It was impossible to afford something like assigning any, let alone 128,
codepoints in a SBCS. Nobody thought of it in MBCS. But they were dealing
with conversions from SBCS which don't have invalid sequences and have very
few unassigned positions. And were able to preserve the invalid sequences.
If UTF-8 would replace them all, we wouldn't need it either, since UTF-8
also CAN preserve invalid sequences. Well, it would be nice if they could be
displayed and collated, but perhaps that would even succeed, since there
would be no other UTFs and the many to one issue would not exist. The
problem of invalid sequences is a Unicode problem. Not addressing it will
not make it go away.
Lars
This archive was generated by hypermail 2.1.5 : Tue Dec 21 2004 - 07:23:14 CST