Hi,
Some people here told me that thanks to the particular structure of the
UTF-8 encoding, you can look at any byte and immediately know where you
are.
The first byte of any sequence_that_represents_a_character_in_UTF8 has
always the most significant bit set to zero. This makes it perfectly
compatible and undistinguishable with 7-bit ASCII whan it is encoding
"regular" US ASCII data.
The second byte (if any) has the most significant bit set to 1 and the
next N most significant bits set to 1 where N is the number of other
bytes that will follow to end the current
sequence_that_represents_a_character_in_UTF8.
For example, if we have a two byte sequence to represent a character, we
will have the bits as follows:
0xxxxxxx 1xxxxxxx
Three-byte sequence:
0xxxxxxx 11xxxxxx 1xxxxxxx
Four-byte sequence:
0xxxxxxx 111xxxxx 11xxxxxx 1xxxxxxx
Single byte character:
0xxxxxxx
So if you look at a byte you can immediately tell where you are.
Going backwards 1 character requires simply to reach a byte with the MSB
set to zero.
Can you confirm this?
I read somebody in this list claiming that UTF-8x (note the "x") is not
backwards scannable unless it is rewound to the start. What's UTF-8x and
why it became non-backwards scannable?
--M
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT