Re: Unicode String Models

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Wed, 12 Sep 2018 01:41:03 +0200

No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really
**do** have UTF-8 encodings (using two bytes).

The only safe way to represent arbitrary bytes within strings when they are
not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a
"UTF-8-like" private extension of UTF-8 (that extension is still not UTF-8!)

This is what Java does for representing U+0000 by (0xC0,0x80) in the
compiled Bytecode or via the C/C++ interface for JNI when converting the
java string buffer into a C/C++ string terminated by a NULL byte (not part
of the Java string content itself). That special sequence however is really
exposed in the Java API as a true unsigned 16-bit code unit (char) with
value 0x0000, and a valid single code point.

The same can be done for reencoding each invalid byte in non-UTF-8
conforming texts using sequences with a "UTF-8-like" scheme (still
compatible with plain UTF-8 for every valid UTF-8 texts): you may either:
  * (a) encode each invalid byte separately (using two bytes for each), or
by encoding them by groups of 3-bits (represented using bytes 0xF8..0FF)
and then needing 3 bytes in the encoding.
  * (b) encode a private starter (e.g. 0xFF), followed by a byte for the
length of the raw bytes sequence that follows, and then the raw bytes
sequence of that length without any reencoding: this will never be confused
with other valid codepoints (however this scheme may no longer be directly
indexable from arbitrary random positions, unlike scheme a which may be
marginally longer longer)
But both schemes (a) or (b) would be useful in editors allowing to edit
arbitrary binary files as if they were plain-text, even if they contain
null bytes, or invalid UTF-8 sequences (it's up to these editors to find a
way to distinctively represent these bytes, and a way to enter/change them
reliably.

There's also a possibility of extension if the backing store uses UTF-16,
as all code units 0x0000.0xFFFF are used, but one scheme is possible by
using unpaired surrogates (notably a low surrogate NOT prefixed by a high
surrogate: the low surrogate already has 10 useful bits that can store any
raw byte value in its lowest bits): this scheme allows indexing from random
position and reliable sequencial traversal in both directions (backward or
forward)...

... But the presence of such extension of UTF-16 means that all the
implementation code handling standard text has to detect unpaired
surrogates, and can no longer assume that a low surrogate necessarily has a
high surrogate encoded just before it: it must be tested and that previous
position may be before the buffer start, causing a possibly buffer overrun
in backward direction (so the code will need to also know the start
position of the buffer and check it, or know the index which cannot be
negative), possibly exposing unrelated data and causing some security
risks, unless the backing store always adds a leading "guard" code unit set
arbitrarily to 0x0000.

Le mer. 12 sept. 2018 à 00:48, J Decker via Unicode <unicode_at_unicode.org> a
écrit :

>
>
> On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode <
> unicode_at_unicode.org> wrote:
>
>>
>> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
>> unicode_at_unicode.org> wrote:
>> >
>> > On Tue, 11 Sep 2018 21:10:03 +0200
>> > Hans Åberg via Unicode <unicode_at_unicode.org> wrote:
>> >
>> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> >> LaTeX files with sections in different Cyrillic and Latin encodings,
>> >> changing the editor encoding while typing.
>> >
>> > Rather like some of the old Unicode list archives, which are just
>> > concatenations of a month's emails, with all sorts of 8-bit encodings
>> > and stretches of base64.
>>
>> It might be useful to represent non-UTF-8 bytes as Unicode code points.
>> One way might be to use a codepoint to indicate high bit set followed by
>> the byte value with its high bit set to 0, that is, truncated into the
>> ASCII range. For example, U+0080 looks like it is not in use, though I
>> could not verify this.
>>
>>
> it's used for character 0x400. 0xD0 0x80 or 0x8000 0xE8 0x80 0x80
> (I'm probably off a bit in the leading byte)
> UTF-8 can represent from 0 to 0x200000 every value; (which is all defined
> codepoints) early varients can support up to U+7FFFFFFF...
> and there's enough bits to carry the pattern forward to support 36 bits or
> 42 bits... (the last one breaking the standard a bit by allowing a byte
> wihout one bit off... 0xFF would be the leadin)
>
> 0xF8-FF are unused byte values; but those can all be encoded into utf-8.
>
Received on Tue Sep 11 2018 - 18:41:56 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 11 2018 - 18:41:57 CDT