RE: UTF-8 validation rules

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon Sep 10 2001 - 15:25:52 EDT


Misha,

> You seem to be using the word "character" in some places where
> you (probably) mean "byte", eg:
>

I am getting fuzzy headed these days. Thanks for pointing it out. It
should read:

> > I am checking out my UTF-8 validation rules to see if they are correct.
> >
> > Check each character to be a valid UTF-8 initial character.
Check each initial character byte to be a valid UTF-8 initial byte.
> >
> > \x00 to \x7f or \xC2 to \xF4
> >
> > Allow invalid forms such as \xC0 & \xC1 to decode but consider
> them invalid.
> >
> > A first byte of \xE0 or \xF0 with a second byte less than \xA0
> is also an
> > invalid form.
> >
> > \xED followed by anything >= \xA0 is an encoded surrogate and
> not a valid
> > character.
> >
> > \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.
> >
> > Anything greater than \xF4\x80\xBF\xBF is beyond the Unicode range.
> >
> > All UTF-8 characters must be followed by the proper number of valid
> > continuation characters, if any.
All UTF-8 initial character bytes must be followed by the proper number of
valid
continuation bytes, if any.

Carl



This archive was generated by hypermail 2.1.2 : Mon Sep 10 2001 - 16:05:39 EDT