From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Dec 16 2004 - 10:23:09 CST
Mark Davis wrote:
> I see more of what you are trying to do; let me try to be more clear.
> Suppose that the conversion is defined in the following way,
> between Unicode
> strings (D29a-d, page 74) and UTFs using your proposed new
> characters, for
> now with private use code points U+E080..U+E0FF.
U+E080 is the first choice by anyone (including my implementor) for
anything, and is therefore not very suitable. Also, AFAIK, U+E000..U+EDFF
are used by EUDC's of some MBCS encodings. For the record, my choice was
U+EE80..U+EEFF.
But I'll keep the rest of the response in-line with your range.
>
> U8-UTF32. To convert an Unicode 8-bit string to UTF-32:
> 1. Set the pointer to the start
> 2. If the sequence starting at the pointer is a valid UTF-8 sequence
> (checking of course to make sure it doesn't go off the end of
> the string),
> convert it and emit.
With one addition. If the obtained value falls into the range of the escape
codepoints (E080 to E0FF), jump to 3. Effectively, escape the escapes.
Without this, the roundtrip is not achieved. An oversight that also my
implementor made. As well as some other people in this thread.
> 3. Otherwise take the byte B following the pointer, and emit
> [E000 + B].
Assuming by 'following the pointer' you meant 'at the pointer'.
> Of course, one could apply this process between the Unicode
> bit strings and
> UTFs of other widths. And the same thing applies; one direction would
> roundtrip and the other wouldn't.
Yes. I have analyzed the consequences and the risks involved and reached the
conclusion that they are either irrelevant or acceptable (or can be dealt
with). And have decided to use this approach. It suits my needs, but I also
think it would suit someone else's needs.
After conversions to U8, it is possible to 'validate' the result (convert
back and compare with the original). Any sequence of escape codepoints that
do not roundtrip in the UTF-U8-UTF direction can be declared as 'invalid' or
'ill-formed' sequence of codepoints (in this context, not in Unicode
context). Note that all (and I think it is also precisely all) sequences
obtained by U8-UTF conversion are 'valid' (in this context). Hence, any
'invalid' sequence can be seen as malicious. Indeed, I suppose an 'invalid'
sequence can result from concatenation, but this does not apply to typical
scenarios, at least not those that need to worry about it. Such 'validation'
could be used in places where security concerns apply. But such 'validation'
is not required in all security scenarios. On the contrary, I think it
applies to very few, and only if they actually use the conversion
themselves. If they simply process such sequences entirely in UTF (or any
combination of UTFs), then again they remain solid (or, as solid as they
are).
> (I realize that some of this may duplicate what others have said
Not really. There is a lot of confusion even about the algorithm itself and
what it achieves. Right from the start I was assuming everyone was familiar
with UTF-8B and therefore didn't want to start from scratch. But perhaps we
should.
Lars
This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 10:31:26 CST