UTF-8 validation rules

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon Sep 10 2001 - 13:21:48 EDT


I am checking out my UTF-8 validation rules to see if they are correct.

Check each character to be a valid UTF-8 initial character.

\x00 to \x7f or \xC2 to \xF4

Allow invalid forms such as \xC0 & \xC1 to decode but consider them invalid.

A first byte of \xE0 or \xF0 with a second byte less than \xA0 is also an
invalid form.

\xED followed by anything >= \xA0 is an encoded surrogate and not a valid
character.

\xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.

Anything greater than \xF4\x80\xBF\xBF is beyond the Unicode range.

All UTF-8 characters must be followed by the proper number of valid
continuation characters, if any.

Carl



This archive was generated by hypermail 2.1.2 : Mon Sep 10 2001 - 14:08:24 EDT