From: Hans Aberg (haberg@math.su.se)
Date: Thu Jun 04 2009 - 02:51:00 CDT
On 1 Jun 2009, at 17:46, Mark Crispin wrote:
> I think that are two obvious implementation choices:
>
> [1] Recognize the sequences for the 0x110000 - 0x7fffffff ranges,
> never generate them, and if a value in that range is encountered
> treat it as an "error" or "not in Unicode" value. This is the
> traditional IETF philosophy.
>
> [2] Strictly enforce the rules for "well formed UTF-8 byte
> sequences" on page 104 of Unicode 5.0, and reject any string which
> fails to comply (note in particular the requirements of the second
> byte).
>
> In all cases, what is generated must strictly comply with "well
> formed UTF-8 byte sequences".
>
> I have little doubt that Unicode would tend to advocate choice [2],
> but as noted above the "IETF way" would be choice [1].
>
> As a practical matter, it should not make any difference. You
> should never expect anything other than a well-formed sequence to
> work.
In the end, I decided to make my own integer-to-byte-encoding, wanting
to cover negative and larger integers, but keeping some fundamental
UTF-8 properties: range 1-127 same, disjoint sets of leading and
trailing bytes admitting resynchronization. (And 0 is not mapped to
'\0', though it could.)
But if one makes a byte code by first making an integer code and
translating it using the UTF-8 method, then it would have the
properties that embedded strings appear as normal UTF-8 (assuming
their integer representation is by their code points). If further, and
editor has the property that all the invalid code points (but still
legal UTF-8) can shown say by escape sequences, then all byte code can
be seen (editors may simply report they cannot parse the code as
UTF-8, not showing anything), perhaps useful for debugging purposes.
So that is one possible use for an extended UTF-8 format, beyond the
Unicode range.
Hans
This archive was generated by hypermail 2.1.5 : Thu Jun 04 2009 - 02:54:08 CDT