Implementation of the roundtripping (was RE: Roundtripping in Uni code)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Dec 16 2004 - 10:23:09 CST

  • Next message: Lars Kristan: "RE: Roundtripping Solved"

    Mark Davis wrote:
    > I see more of what you are trying to do; let me try to be more clear.
    > Suppose that the conversion is defined in the following way,
    > between Unicode
    > strings (D29a-d, page 74) and UTFs using your proposed new
    > characters, for
    > now with private use code points U+E080..U+E0FF.

    U+E080 is the first choice by anyone (including my implementor) for
    anything, and is therefore not very suitable. Also, AFAIK, U+E000..U+EDFF
    are used by EUDC's of some MBCS encodings. For the record, my choice was
    U+EE80..U+EEFF.

    But I'll keep the rest of the response in-line with your range.

    >
    > U8-UTF32. To convert an Unicode 8-bit string to UTF-32:
    > 1. Set the pointer to the start
    > 2. If the sequence starting at the pointer is a valid UTF-8 sequence
    > (checking of course to make sure it doesn't go off the end of
    > the string),
    > convert it and emit.

    With one addition. If the obtained value falls into the range of the escape
    codepoints (E080 to E0FF), jump to 3. Effectively, escape the escapes.
    Without this, the roundtrip is not achieved. An oversight that also my
    implementor made. As well as some other people in this thread.

    > 3. Otherwise take the byte B following the pointer, and emit
    > [E000 + B].

    Assuming by 'following the pointer' you meant 'at the pointer'.

    > Of course, one could apply this process between the Unicode
    > bit strings and
    > UTFs of other widths. And the same thing applies; one direction would
    > roundtrip and the other wouldn't.
    Yes. I have analyzed the consequences and the risks involved and reached the
    conclusion that they are either irrelevant or acceptable (or can be dealt
    with). And have decided to use this approach. It suits my needs, but I also
    think it would suit someone else's needs.

    After conversions to U8, it is possible to 'validate' the result (convert
    back and compare with the original). Any sequence of escape codepoints that
    do not roundtrip in the UTF-U8-UTF direction can be declared as 'invalid' or
    'ill-formed' sequence of codepoints (in this context, not in Unicode
    context). Note that all (and I think it is also precisely all) sequences
    obtained by U8-UTF conversion are 'valid' (in this context). Hence, any
    'invalid' sequence can be seen as malicious. Indeed, I suppose an 'invalid'
    sequence can result from concatenation, but this does not apply to typical
    scenarios, at least not those that need to worry about it. Such 'validation'
    could be used in places where security concerns apply. But such 'validation'
    is not required in all security scenarios. On the contrary, I think it
    applies to very few, and only if they actually use the conversion
    themselves. If they simply process such sequences entirely in UTF (or any
    combination of UTFs), then again they remain solid (or, as solid as they
    are).

    > (I realize that some of this may duplicate what others have said
    Not really. There is a lot of confusion even about the algorithm itself and
    what it achieves. Right from the start I was assuming everyone was familiar
    with UTF-8B and therefore didn't want to start from scratch. But perhaps we
    should.

    Lars



    This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 10:31:26 CST