Re: (M) Scanning UTF-8 backwards is possible?

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Mon Aug 31 1998 - 07:00:45 EDT


Marco Mussini wrote on 1998-08-31 06:38 UTC:
> The first byte of any sequence_that_represents_a_character_in_UTF8 has
> always the most significant bit set to zero. This makes it perfectly
> compatible and undistinguishable with 7-bit ASCII whan it is encoding
> "regular" US ASCII data.
> The second byte (if any) has the most significant bit set to 1 and the
> next N most significant bits set to 1 where N is the number of other
> bytes that will follow to end the current
> sequence_that_represents_a_character_in_UTF8.
>
> For example, if we have a two byte sequence to represent a character, we
> will have the bits as follows:
>
> 0xxxxxxx 1xxxxxxx
>
> Three-byte sequence:
>
> 0xxxxxxx 11xxxxxx 1xxxxxxx

I think you completely misunderstood UTF-8. UTF-8 looks like this
(copied from the Linux utf-8 man page):

ENCODING
       The following byte sequences are used to represent a char-
       acter. The sequence to be used depends on the UCS code
       number of the character:

       0x00000000 - 0x0000007F:
           0xxxxxxx

       0x00000080 - 0x000007FF:
           110xxxxx 10xxxxxx

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx

       0x00010000 - 0x001FFFFF:
           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x00200000 - 0x03FFFFFF:
           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x04000000 - 0x7FFFFFFF:
           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       The xxx bit positions are filled with the bits of the
       character code number in binary representation. Only the
       shortest possible multibyte sequence which can represent
       the code number of the character can be used.

EXAMPLES
       The Unicode character 0xa9 = 1010 1001 (the copyright
       sign) is encoded in UTF-8 as

              11000010 10101001 = 0xc2 0xa9

       and character 0x2260 = 0010 0010 0110 0000 (the "not
       equal" symbol) is encoded as:

              11100010 10001001 10100000 = 0xe2 0x89 0xa0

The algorithms for forward and (if necessary backward) scanning are
rather obvious: The first character of any sequence always fulfils the
condition ((c & 0xc0) != 0x80)), and the last character is identified by
having another first character as its successor. That's all. UTF-8 is
really incredibly simple and easy to handle. Whoever looks for
alternative encodings just hasn't seen the light yet, IMHO.

Markus

-- 
Markus G. Kuhn, Security Group, Computer Lab, Cambridge University, UK
email: mkuhn at acm.org,  home page: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT