From: Hans Aberg (haberg@math.su.se)
Date: Sun May 31 2009 - 15:39:28 CDT
On 31 May 2009, at 22:18, Doug Ewell wrote:
>>>> In particular, it would be great to know if the range U+0080, , U
>>>> +009F is invalid.
>>>
>>> That bit is especially wrong. I can at least imagine why there
>>> might be confusion about the noncharacters and surrogate code
>>> points, but not the C1 controls.
>>
>> It is a bit disappointing: I was looking for a beginning (escape)
>> byte sequence to tell that string isn't UTF-8, among other valid
>> strings. But perhaps it does not matter.
>
> If you're thinking about inventing one, for your own use, then any
> byte sequence that is not valid UTF-8 should do the job. One
> possibility would be {0xA0}.
Thank you for the suggestion.
> Be sure you understand the difference between an invalid *byte
> sequence* and an invalid *code point*. There are many invalid byte
> sequences in UTF-8. As Mark pointed out, the only invalid code
> points are the surrogates.
Yes, I am thinking about both possibilities. The idea is in an
environment of C strings, '\0' terminated then, also pass some byte
code objects for those programs that can parse it.
> The section of the Wikipedia article you cited actually contains
> quite a concentration of misleading information:
Yes, that is quite of a mess. I think also strictly speaking there are
two UTF-8s: one which does not have the integer limitations that are
used in Unicode. This could be used to convert integers sequences into
byte sequences which then do not have Unicode character
interpretation. So I like to think of Unicode UTF-8 composed of two
parts: one natural number to byte-sequence conversion, which is the
real UTF-8, and on top of that, and interpretation of the natural
numbers as Unicode characters, which as such do not have anything to
do with this natural number-to-byte conversion.
Hans
This archive was generated by hypermail 2.1.5 : Sun May 31 2009 - 15:41:44 CDT