Re: Unicode String Models

From: Hans Åberg via Unicode <unicode_at_unicode.org>
Date: Wed, 12 Sep 2018 10:37:00 +0200

> On 12 Sep 2018, at 04:34, Eli Zaretskii via Unicode <unicode_at_unicode.org> wrote:
>
>> Date: Wed, 12 Sep 2018 00:13:52 +0200
>> Cc: unicode_at_unicode.org
>> From: Hans Åberg via Unicode <unicode_at_unicode.org>
>>
>> It might be useful to represent non-UTF-8 bytes as Unicode code points. One way might be to use a codepoint to indicate high bit set followed by the byte value with its high bit set to 0, that is, truncated into the ASCII range. For example, U+0080 looks like it is not in use, though I could not verify this.
>
> You must use a codepoint that is not defined by Unicode, and never
> will. That is what Emacs does: it extends the Unicode codepoint space
> beyond 0x10FFFF.

The idea is to extend Unicode itself, so that those bytes can be represented by legal codepoints. Then U+0080 has had some use in other encodings, but it looks like not in Unicode itself. But one could use some other value or values, and mark it for this special purpose.

There are a number of other byte sequences that are in use, too, like overlong UTF-8. Also original UTF-8 can be extended to handle all 32-bit words, also those with the high bit set, then.
Received on Wed Sep 12 2018 - 03:37:27 CDT

This archive was generated by hypermail 2.2.0 : Wed Sep 12 2018 - 03:37:28 CDT