RE: UTF-8 validation rules

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon Sep 10 2001 - 16:43:03 EDT


Ken,

> -----Original Message-----
> From: Kenneth Whistler [mailto:kenw@sybase.com]
> Sent: Monday, September 10, 2001 12:48 PM
> To: cbrown@xnetinc.com
> Cc: unicode@unicode.org
> Subject: Re: UTF-8 validation rules
>
>
> Carl,
>
> >
> > \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.
>
> In current parlance (see Unicode 3.1, UAX #27), these are
> "noncharacters", and you must account for the fact that
> U+1FFFE..U+1FFFF
> U+2FFFE..U+2FFFF
> U+10FFFE..U+10FFFF
>

Based on http://www.unicode.org/unicode/reports/tr27/ I added the check or 4
byte codes:

        if (ch[1] & 0x0F == 0x0F) /* U+nFFFE & U+nFFFF are invalid */
        {
                if (ch[2] == 0xBF && ch[3] >= 0xBE)
                {
                        curr_thread->status = U_ILLEGAL_CHAR_FOUND;
                        return ch - source;
                }
        }

I also used the handy charts to see that I had made a calculation error. I
found that the shortest form for 4 byte codes starts at \x90\x80\x80
instead of \xF0\xA0\x80\x80.

Thanks,

Carl



This archive was generated by hypermail 2.1.2 : Mon Sep 10 2001 - 17:34:19 EDT