Re: UTF-8 validation rules

From: Misha.Wolf@reuters.com
Date: Mon Sep 10 2001 - 14:50:29 EDT


Carl,

You seem to be using the word "character" in some places where
you (probably) mean "byte", eg:

> All UTF-8 characters must be followed by the proper number of valid
> continuation characters, if any.

Misha

On 10/09/2001 18:21:48 Carl W. Brown wrote:
> I am checking out my UTF-8 validation rules to see if they are correct.
>
> Check each character to be a valid UTF-8 initial character.
>
> \x00 to \x7f or \xC2 to \xF4
>
> Allow invalid forms such as \xC0 & \xC1 to decode but consider them invalid.
>
> A first byte of \xE0 or \xF0 with a second byte less than \xA0 is also an
> invalid form.
>
> \xED followed by anything >= \xA0 is an encoded surrogate and not a valid
> character.
>
> \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.
>
> Anything greater than \xF4\x80\xBF\xBF is beyond the Unicode range.
>
> All UTF-8 characters must be followed by the proper number of valid
> continuation characters, if any.
>
> Carl
>
>
>

-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.



This archive was generated by hypermail 2.1.2 : Mon Sep 10 2001 - 15:58:48 EDT