From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Dec 22 2004 - 09:10:54 CST
Philippe Verdy wrote:
> Please don't use the term "normalize" in this context.
> Normalization in
> Unicode involves transformation of the stream of *code
> points*, but is
> independant of their encoding form or encoding scheme.
Yes, I believe I shouldn't have used "normalization". I do not know if this
word has a general meaning and can be used when describing escaping or
TES-es. For now I will assume it can, until someone enlightens me. But yes,
since we're discussing Unicode (I still believe we are), it could be
confusing. So, let me use "pre-normalization", and please consider that in
the post being discussed this is what I meant. None of the references to
"normalization" was intended to mean Unicode Normalization.
I did want to point out the similarity though. And that the
pre-normalization is typically strongly tied to Unicode Normalization. Where
Unicode Normalization is desired or required, pre-normalization also
applies. There could be cases where only one of the two needs to be used,
but I believe the two would usually go together. Specifically, in the case
of specifying filenames in UI, no normalization should be applied, unless
the filesystem itself only allows normalized instances of filenames. Goes
for any normalization, and any store (when selecting, not when searching).
For the sake of completeness, let me attempt to describe how a normalization
consisting of several sub-normalizations should be done, assuming my
escaping technique is used. Let me use a W3C example. W3C pre-normalization
should be done first. It can produce any codepoint, but is not affected
itself by Unicode Normalization, nor can it be affected by MUTF-8
pre-normalization (since it can produce no codepoints in the U+0000..U+007F
range). Next, MUTF-8 normalization is applied, since it can produce
codepoints that will need to be Unicode Normalized. Last, Unicode
Normalization is applied.
Now, I can assume someone would want to also use a CESU-8 pre-normalization
in that process. It needs to be done just before Unicode Normalization and
is not needed if data is in UTF-16 form at that point.
Note that the above process is straightforward and only the order matters.
However, that is only true because of some assumptions. Things would get
complicated if:
A - MUTF-8 (escaping) would use codepoints outside of the BMP. Then CESU-8
pre-normalization could produce a codepoint that would need MUTF-8
pre-normalization. The reverse is already true. In general, one would need
to alternate the two pre-normalizations until they both finish.
B - Some other escaping technique is used where I used the W3C as an example
and this escaping technique would use a non-ASCII codepoint (alone or within
the escape sequence). Again, pre-normalizations would need to be alternated.
If this codepoint would be non-BMP, then all three would need to be applied
repeatedly.
OK, one more thing. I have tested my conversion with the controversial test
file mentioned in the "UTF-8 stress test file?" thread. I was very pleased
with the result (which I cannot say for the way Internet Explorer displays
the original file: it is not dropping invalid sequences, but some invalid
sequences 'eat' characters after them). But while I examined my output, I
noticed that I treated both unpaired and paired (CESU-8) surrogates as
invalid sequences. Which puzzled me for a moment since I never had any
intentions to actively obstruct CESU-8. But I made the right decision when I
was implementing the conversion - CESU-8 input would not roundtrip
otherwise.
Another example of what one could do is implement a very forgiving NON-UTF-8
decoder, which would preserve most of invalid sequences, normalize CESU-8
and partially pre-normalize MUTF-8 escape sequences. The implementation of
such decoder would be close to mine, with two checks that guarantee the
rountrip removed, namely escaping the escapes and treating surrogates as
invalid sequences. An optional next step would be full MUTF-8
pre-normalization. If output would not be UTF-16, then CESU-8
pre-normalization would be the last step needed. The equivalent of the above
is:
1 - CESU-8 pre-normalization
2 - use unmodified MUTF-8 decoder
3 - one step (optionally full) MUTF-8 pre-normalization
4 - CESU-8 pre-normalization
But the same (especially when full pre-normaliztion is not required) can be
achieved more efficiently by modifying the behavior of the function itself.
Lars
This archive was generated by hypermail 2.1.5 : Wed Dec 22 2004 - 09:17:30 CST