Marco Mussini wrote on 1998-08-31 06:38 UTC:
> The first byte of any sequence_that_represents_a_character_in_UTF8 has
> always the most significant bit set to zero. This makes it perfectly
> compatible and undistinguishable with 7-bit ASCII whan it is encoding
> "regular" US ASCII data.
> The second byte (if any) has the most significant bit set to 1 and the
> next N most significant bits set to 1 where N is the number of other
> bytes that will follow to end the current
> sequence_that_represents_a_character_in_UTF8.
>
> For example, if we have a two byte sequence to represent a character, we
> will have the bits as follows:
>
> 0xxxxxxx 1xxxxxxx
>
> Three-byte sequence:
>
> 0xxxxxxx 11xxxxxx 1xxxxxxx
I think you completely misunderstood UTF-8. UTF-8 looks like this
(copied from the Linux utf-8 man page):
ENCODING
The following byte sequences are used to represent a char-
acter. The sequence to be used depends on the UCS code
number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x00200000 - 0x03FFFFFF:
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x04000000 - 0x7FFFFFFF:
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The xxx bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multibyte sequence which can represent
the code number of the character can be used.
EXAMPLES
The Unicode character 0xa9 = 1010 1001 (the copyright
sign) is encoded in UTF-8 as
11000010 10101001 = 0xc2 0xa9
and character 0x2260 = 0010 0010 0110 0000 (the "not
equal" symbol) is encoded as:
11100010 10001001 10100000 = 0xe2 0x89 0xa0
The algorithms for forward and (if necessary backward) scanning are
rather obvious: The first character of any sequence always fulfils the
condition ((c & 0xc0) != 0x80)), and the last character is identified by
having another first character as its successor. That's all. UTF-8 is
really incredibly simple and easy to handle. Whoever looks for
alternative encodings just hasn't seen the light yet, IMHO.
Markus
-- Markus G. Kuhn, Security Group, Computer Lab, Cambridge University, UK email: mkuhn at acm.org, home page: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT