RE: Is it roundtripping or transfer-encoding

From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Dec 22 2004 - 09:10:54 CST

  • Next message: Deborah W. Anderson: "NEH Support for Unicode Script Encoding Project"

    Philippe Verdy wrote:
    > Please don't use the term "normalize" in this context.
    > Normalization in
    > Unicode involves transformation of the stream of *code
    > points*, but is
    > independant of their encoding form or encoding scheme.

    Yes, I believe I shouldn't have used "normalization". I do not know if this
    word has a general meaning and can be used when describing escaping or
    TES-es. For now I will assume it can, until someone enlightens me. But yes,
    since we're discussing Unicode (I still believe we are), it could be
    confusing. So, let me use "pre-normalization", and please consider that in
    the post being discussed this is what I meant. None of the references to
    "normalization" was intended to mean Unicode Normalization.

    I did want to point out the similarity though. And that the
    pre-normalization is typically strongly tied to Unicode Normalization. Where
    Unicode Normalization is desired or required, pre-normalization also
    applies. There could be cases where only one of the two needs to be used,
    but I believe the two would usually go together. Specifically, in the case
    of specifying filenames in UI, no normalization should be applied, unless
    the filesystem itself only allows normalized instances of filenames. Goes
    for any normalization, and any store (when selecting, not when searching).

    For the sake of completeness, let me attempt to describe how a normalization
    consisting of several sub-normalizations should be done, assuming my
    escaping technique is used. Let me use a W3C example. W3C pre-normalization
    should be done first. It can produce any codepoint, but is not affected
    itself by Unicode Normalization, nor can it be affected by MUTF-8
    pre-normalization (since it can produce no codepoints in the U+0000..U+007F
    range). Next, MUTF-8 normalization is applied, since it can produce
    codepoints that will need to be Unicode Normalized. Last, Unicode
    Normalization is applied.

    Now, I can assume someone would want to also use a CESU-8 pre-normalization
    in that process. It needs to be done just before Unicode Normalization and
    is not needed if data is in UTF-16 form at that point.

    Note that the above process is straightforward and only the order matters.
    However, that is only true because of some assumptions. Things would get
    complicated if:
    A - MUTF-8 (escaping) would use codepoints outside of the BMP. Then CESU-8
    pre-normalization could produce a codepoint that would need MUTF-8
    pre-normalization. The reverse is already true. In general, one would need
    to alternate the two pre-normalizations until they both finish.
    B - Some other escaping technique is used where I used the W3C as an example
    and this escaping technique would use a non-ASCII codepoint (alone or within
    the escape sequence). Again, pre-normalizations would need to be alternated.
    If this codepoint would be non-BMP, then all three would need to be applied
    repeatedly.

    OK, one more thing. I have tested my conversion with the controversial test
    file mentioned in the "UTF-8 stress test file?" thread. I was very pleased
    with the result (which I cannot say for the way Internet Explorer displays
    the original file: it is not dropping invalid sequences, but some invalid
    sequences 'eat' characters after them). But while I examined my output, I
    noticed that I treated both unpaired and paired (CESU-8) surrogates as
    invalid sequences. Which puzzled me for a moment since I never had any
    intentions to actively obstruct CESU-8. But I made the right decision when I
    was implementing the conversion - CESU-8 input would not roundtrip
    otherwise.

    Another example of what one could do is implement a very forgiving NON-UTF-8
    decoder, which would preserve most of invalid sequences, normalize CESU-8
    and partially pre-normalize MUTF-8 escape sequences. The implementation of
    such decoder would be close to mine, with two checks that guarantee the
    rountrip removed, namely escaping the escapes and treating surrogates as
    invalid sequences. An optional next step would be full MUTF-8
    pre-normalization. If output would not be UTF-16, then CESU-8
    pre-normalization would be the last step needed. The equivalent of the above
    is:
    1 - CESU-8 pre-normalization
    2 - use unmodified MUTF-8 decoder
    3 - one step (optionally full) MUTF-8 pre-normalization
    4 - CESU-8 pre-normalization
    But the same (especially when full pre-normaliztion is not required) can be
    achieved more efficiently by modifying the behavior of the function itself.

    Lars



    This archive was generated by hypermail 2.1.5 : Wed Dec 22 2004 - 09:17:30 CST