Re: Invalid code points

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jun 04 2009 - 02:51:00 CDT

Next message: William_J_G Overington: "Re: Invalid code points"

Previous message: Damon Anderson: "Re: Fonts across platforms...."
In reply to: Mark Crispin: "RE: Invalid code points"
Next in thread: Hans Aberg: "Re: Invalid code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 1 Jun 2009, at 17:46, Mark Crispin wrote:

> I think that are two obvious implementation choices:
>
> [1] Recognize the sequences for the 0x110000 - 0x7fffffff ranges,
> never generate them, and if a value in that range is encountered
> treat it as an "error" or "not in Unicode" value. This is the
> traditional IETF philosophy.
>
> [2] Strictly enforce the rules for "well formed UTF-8 byte
> sequences" on page 104 of Unicode 5.0, and reject any string which
> fails to comply (note in particular the requirements of the second
> byte).
>
> In all cases, what is generated must strictly comply with "well
> formed UTF-8 byte sequences".
>
> I have little doubt that Unicode would tend to advocate choice [2],
> but as noted above the "IETF way" would be choice [1].
>
> As a practical matter, it should not make any difference. You
> should never expect anything other than a well-formed sequence to
> work.

In the end, I decided to make my own integer-to-byte-encoding, wanting
to cover negative and larger integers, but keeping some fundamental
UTF-8 properties: range 1-127 same, disjoint sets of leading and
trailing bytes admitting resynchronization. (And 0 is not mapped to
'\0', though it could.)

But if one makes a byte code by first making an integer code and
translating it using the UTF-8 method, then it would have the
properties that embedded strings appear as normal UTF-8 (assuming
their integer representation is by their code points). If further, and
editor has the property that all the invalid code points (but still
legal UTF-8) can shown say by escape sequences, then all byte code can
be seen (editors may simply report they cannot parse the code as
UTF-8, not showing anything), perhaps useful for debugging purposes.

So that is one possible use for an extended UTF-8 format, beyond the
Unicode range.

Hans

Next message: William_J_G Overington: "Re: Invalid code points"
Previous message: Damon Anderson: "Re: Fonts across platforms...."
In reply to: Mark Crispin: "RE: Invalid code points"
Next in thread: Hans Aberg: "Re: Invalid code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jun 04 2009 - 02:54:08 CDT