Re: please review the paper for me

From: Yung-Fong Tang (ftang@netscape.com)
Date: Tue Feb 25 2003 - 17:46:52 EST

  • Next message: Francois Yergeau: "RE: please review the paper for me"

    ok, I find some problem and rewirte a little bit. Here is the rev 0.4 .
    Please give me your comment.


    The Algorithm to Valide an UTF-8 String

    The Algorithm to Validate an UTF-8 String

    Frank Yung-Fong Tang <ftang@netscape.com>
    Draft: 0.4
    Status of this document: DRAFT. NEED TO BE CHECKED and REVIEWED

    Background Information

    UTF-8 is widly used by many software these days. Many Internet Draft, RFC and W3C standards/recommendation refers to it. Many public or private software start to support it. However, a lot of people do not know how to perform a common task- how to validate an UTF-8 string.

    Applications of this Algorithm

    The implemtation of this UTF-8 validation process/function could be used to

    Consideration

    In order to consider a string is a valide UTF-8 string. The following thing need to be considered:
    1. The octects sequences should fit into the UTF-8 pattern
    2. No five and sex octets UTF-8 sequence and no unused byte
    3. No non-shortest form
    4. No suurogate characters encoded directly in 3-bytes UTF-8 sequence
    5. No more than pattern represent UCS4 value larger than U+10FFFF

    The octects sequences should fit into the UTF-8 pattern

    UTF-8 have the following byte patterns according to RFC 2279:
    Number 
    of Bytes
    UCS-4 range (hex.)
    UTF-8 octet sequence (binary)
    UTF-8 octet range (hex)
    1
    0000 0000-0000 007F
    0xxxxxxx
    0x00-0x7f
    2
    0000 0080-0000 07FF
    110xxxxx 10xxxxxx
    0xc0-0xdf 0x80-0xbf
    3
    0000 0800-0000 FFFF
    1110xxxx 10xxxxxx 10xxxxxx
    0xe0-0xef 0x80-0xbf 0x80-0xbf
    4
    0001 0000-001F FFFF
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    0xf0-0xf7 0x80-0xbf 0x80-0xbf 0x80-0xbf

    See the next session about the five and 6 octets sequences mentioned in RFC 2279

    No five and six octets UTF-8 sequence and no unused byte 

    Unicode 3.1 ruled out the five and six octets UTF-8 sequence as illegal although previous standard / specification such as Unciode 3.0 and RFC 2279 allow the five and six octets UTF-8 sequence. Therefore, we need to make sure those value are not in the UTF-8

    The checking is easy. If you just allow the patten in the previous session in the sequence, then these five and six octets squence won't be allowed. Another way to look at it is in the old UTF-8 definitation, the first octet of the five octect sequence is start with 0xf8-0xfb, the first octet of the five octect sequence is start with 0xfc-0xfd, and these bytes never appeared in any other places in the UTF-8 sequence. The byte 0xfe and 0xff is also never used in UTF-8 neither. Therefore, we now can conclude any sequence have a byte range 0xf8-0xfd is not a valid UTF-8 sequence.

    This was introduced into Unicode 3.1 dated May 6, 2001

    No non-shortest form

    With the byte pattern in UTF-8, it is possible to use non shortest form the encode UCS4. However, Unicode 3.1 already rule out such sequence is legal UTF-8 sequence due to the concern of security. Therefore, we could check the sequence to see any of the following pattern exist, if so, then it is not a valid UTF-8 sequence:

    Number 
    of Bytes
    UCS-4 range (hex.)
    Non-shortest form (which is illegal) UTF-8 octet sequence (binary)
    Non-shortest form (which is illegal) UTF-8 octet range (hex)
    2
    0000 0000-0000 007F
    1100000x 10xxxxxx
    0xc0-0xc1 0x80-0xbf
    3
    0000 0080-0000 07FF
    11100000 100xxxxx 10xxxxxx
    0xe0      0x80-0x9f 0x80-0xbf
    4
    0000 0800-0000 FFFF
    11110000 1000xxxx 10xxxxxx 10xxxxxx
    0xf0      0x80-0x8f 0x80-0xbf 0x80-0xbf

    Since 0xc0 and 0xc1 in UTF-8 should only be used as the first octet of a 2 octet sequence, and that always lead to a 2 octect non-shortest form of UCS4 range in 0x00-0x7f, we can conclude in a legal UTF-8 sequence, there should have no appearance of octet 0xc0-0xc1.
    Similar to the byte 0xe0, we can conclude there are no UTF-8 sequence have the sequence 0xe0 followed by a octect in the range of 0x80-0x9f. For the 4 octect sequence, we can conclude there are no UTF-8 sequence have the sequence 0xf0 followed by a octects in the range of 0x80-0x8f.

    This was introduced into Unicode 3.1 dated May 6, 2001

    No suurogate characters encoded directly in 3-bytes UTF-8 sequence

    This is similar to non-shortest form. In UTF-16, the UCS4 U+10000 - U+10FFFF is encoded with a surrogate pair - 0xD800-0xDBFF then 0xDC00-0xDFFF. In the UTF-8, they should be encoded in the 4 octects ranges instead of two three-octect UTF-8 sequence which represent the surrogate pair directly with the bit pattern shifting. Therefore, the patten in the three octect UTF-8 seqeunce which represent value 0xD800-0xDFFF should be consider illegal UTF-8

    Number 
    of Bytes
    UCS-4 range (hex.)
    Surrogate high and surrogate low range directly map to UTF-8 octet sequence (binary)
    Surrogate high and surrogate low range directly map to UTF-8 octet range (hex)
    3
    0000 D800-0000 DFFF
    11101101 101xxxxx 10xxxxxx
    0xed 0xa0-0xbf 0x80-0xbf

    Therefore, we can conclude any octet 0xed followed by a octect ranged in 0xa0-0xbf is illegal UTF-8.

    This was introduced into Unicode 3.2 dated March 27, 2002


    No more than pattern represent UCS4 value larger than U+10FFFF

    Unicode defined BMP and only plane 1 to plane 16. Therefore, there are no Unicode could be greater than U+10FFFF. Therefore, the combination of UTF-8 which lead to a UCS4 value larger than 10ffff should be consider illegal unicode. Since 4 octets UTF-8 sequence origionally could encode UCS4 value up to 0x1ffff, we need to make sure UTF-8 pattern represent 0x110000 to 0x1fffff be considered as illegal. This need to be break down to 4 sub range
    Number 
    of Bytes
    UCS-4 range (hex.)
    Illegal UTF-8 octet sequence (binary) represent UCS4 value greater than 0x10FFFF
    Illegal UTF-8 octet range (hex) represent UCS4 value greater than 0x10FFFF
    4
    0011 0000-0011 FFFF
    11110100 1001xxxx 10xxxxxx 10xxxxxx
    0xf4      0x90-0x9f 0x80-0xbf 0x80-0xbf
    4
    0012 0000-0013 FFFF
    11110100 101xxxxx 10xxxxxx 10xxxxxx
    0xf4      0xa0-0xbf 0x80-0xbf 0x80-0xbf
    4
    0014 0000-0017 FFFF
    11110101 10xxxxxx 10xxxxxx 10xxxxxx
    0xf5      0x80-0xbf 0x80-0xbf 0x80-0xbf
    4
    0018 0000-001F FFFF
    1111011x 10xxxxxx 10xxxxxx 10xxxxxx
    0xf6-0xf7 0x80-0xbf 0x80-0xbf 0x80-0xbf

    Therefore, we can conclude a sequence include byte range in 0xf5-0xf7 and a two octets sequence 0xf4 followed by 0x90-0xbf always should be treated as illegal UTF-8.

    This was introduced into Unicode 3.1 dated May 6, 2001

    Summary:

    Below are the byte values which never should be seen in a valid UTF-8 string

    Bytes never used by valid UTF-8:
    Reason
    0xFE-0xFF
    Is not used by any UTF-8 octet pattern, not leading octet, neither trial octet
    0xF8-0xFB
    Represent 5 octets UTF-8 sequence which the definitation obsoleted by Unicode 3.1
    0xFC-0xFD
    Represent 6 octets UTF-8 sequence which the definitation obsoleted by Unicode 3.1
    0xC0-0xC1
    Represent non-shortest form in 2 octet sequence if followed by one 0x80-0xbf
    0xF5-0xF7
    Represent value larger than U+10FFFF in 4 octet sequence if followed by three 0x80-0xbf

    Below are the two-bytes paris which never should be seen in a valid UTF-8 string

    Two byte paris never used in UTF-8:
    Reason
    0xc0-0xf7 0x00-0x7f
    0xc0-0xf7 0xc0-0xff
    Does not match with UTF-8 pattern
    0xe0      0x80-0x9f
    Represent non-shortest form in 3 octet sequence if followed by one 0x80-0xbf
    0xf0      0x80-0x8f

    Represent non-shortest form in 4 octet sequence if followed by two 0x80-0xbf

    0xed      0xa0-0xbf

    Represent range 0xd800-0xdfff in 3 octet sequence if followed by one 0x80-0xbf

    0xf4      0x90-0xbf
    Represent value larger than U+10FFFF in 4 octet sequence if followed by one 0x80-0xbf

    Algorithm

    A regular expression algorithm

    The validation code could be implemented by regular expression if environment permit:

    We can say a sequence is UTF-8 if all the bytes match
    /^(([\0-\x7F])|
    ([\xC0-\xDF][\x80-\xBF])|
    ([\xE0-\xEF][\x80-\xBF][\x80-\xBF])|
    ([\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]))*$/
    But in the mean time it does NOT match any of the following
     ([\xC0-\xC1])|
    ([\xE0][\x80-\x9F])|
    ([\xF0][\x80-\x8F])|
    ([\xED][\xA0-\xBF])|
    ([\xF4][\x90-\xBF])|
    ([\xF5-\xF7])

    This regular expression can be simplified to
    /^(([\0-\x7F])
    |([\xC2-\xDF][\x80-\xBF])
    |((([\xE0][\xA0-\xBF])
    |([\xE1-\xEC\xEE-\xEF][\x80-\xBF])
    |([\xED][\x80-\x9F])
    )[\x80-\xBF])
    |((([\xF0][\x90-\xBF])
    |([\xF1-\xF3][\x80-\xBF])
    |([\xF4][\x80-\x8F])
    )[\x80-\xBF][\x80-\xBF])
    )*$/

    A state machine algorithm

    The validation process could also be implemented by a statemachine if the environment does not support regular expression or the regluar expression implementation is not effective enough.

    Below is the state machine diagram
    Change the state to
    Current State
    Input
    START
    A
    B
    C
    D
    E
    F
    G
    0x00-7F
    START
    ERROR
    ERROR
    ERROR
    ERROR
    ERROR
    ERROR
    ERROR
    0x80-0x8F
    ERROR
    START
    A
    A

    B
    B
    0x90-0x9F
    B
    ERROR
    0xA0-0xBF
    A
    ERROR
    0xC0-0xC1,0xF5-0xFF
    ERROR
    ERROR
    ERROR
    ERROR
    ERROR
    0xC2-0xDF
    A
    0xE0
    C
    0xE1-0xEC, 0xEE-0xEF
    B
    0xED
    D
    0xF0
    F
    0xF1-0xF3
    E
    0xF4
    G

    Sample Implementations

    A Perl Implementation of the Regular Expression Algorithm:

    #
    # The contents of this file are subject to the Netscape Public License
    # Version 1.0 (the "NPL"); you may not use this file except in
    # compliance with the NPL. You may obtain a copy of the NPL at
    # http://www.mozilla.org/NPL/
    #
    # Software distributed under the NPL is distributed on an "AS IS" basis,
    # WITHOUT WARRANTY OF ANY KIND, either express or implied. See the NPL
    # for the specific language governing rights and limitations under the
    # NPL.
    #
    # The Initial Developer of this code under the NPL is Netscape
    # Communications Corporation. Portions created by Netscape are
    # Copyright (C) 1998 Netscape Communications Corporation. All Rights
    # Reserved.
    #

    sub ValidUTF8 {
    local ( $utf8 ) = pop (@_);
    return ($utf8 =~ /^(([\0-\x7F])|([\xC2-\xDF][\x80-\xBF])|
    ((([\xE0][\xA0-\xBF])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF])|([\xED][\x80-\x9F]))[\x80-\xBF])|
    ((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))[\x80-\xBF][\x80-\xBF]))*$/);
    }

    A C Implementation of the State Machine Algorithm:

    /* -*- Mode: C; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 2 -*-
    *
    * The contents of this file are subject to the Netscape Public License
    * Version 1.0 (the "NPL"); you may not use this file except in
    * compliance with the NPL. You may obtain a copy of the NPL at
    * http://www.mozilla.org/NPL/
    *
    * Software distributed under the NPL is distributed on an "AS IS" basis,
    * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the NPL
    * for the specific language governing rights and limitations under the
    * NPL.
    *
    * The Initial Developer of this code under the NPL is Netscape
    * Communications Corporation. Portions created by Netscape are
    * Copyright (C) 1998 Netscape Communications Corporation. All Rights
    * Reserved.
    */

    /*
    * Character class
    * 0x00-0x7F: 0
    * 0x80-0x8F: 1
    * 0x90-0x9f: 2
    * 0xa0-0xbf: 3
    * 0xc0-0xc1, 0xf5-0xff: 4
    * 0xc2-0xdf: 5
    * 0xe0: 6
    * 0xe1-0xec, 0xee-0xef: 7
    * 0xed: 8
    * 0xf0: 9
    * 0xf1-0xf3: 10
    * 0xf4: 11
    */
    static int byte_class_table[256] =
    {
    /* 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F */
    /* 00 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* 10 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* 20 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* 30 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* 40 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* 50 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* 60 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* 70 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* 80 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    /* 90 */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    /* A0 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    /* B0 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    /* C0 */ 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
    /* D0 */ 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
    /* E0 */ 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 7, 7,
    /* F0 */ 9,10,10,10,11, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
    /* 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F */
    };
    /* state table */
    typedef enum {
    kSTART = 0,
    kA,
    kB,
    kC,
    kD,
    kE,
    kF,
    kG,
    kERROR,
    kNumOfStates
    } utf8_state;

    static utf8_state state_table[] = {
    /* kSTART, kA, kB, kC, kD, kE, kF, kG, kERROR */
    /* 0x00-0x7F: 0 */ kSTART, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
    /* 0x80-0x8F: 1 */ kERROR, kSTART, kA, kERROR, kA, kB, kERROR, kB, kERROR,
    /* 0x90-0x9f: 2 */ kERROR, kSTART, kA, kERROR, kA, kB, kB, kERROR, kERROR,
    /* 0xa0-0xbf: 3 */ kERROR, kSTART, kA, kA, kERROR, kB, kB, kERROR, kERROR,
    /* 0xc0-0xc1, 0xf5-0xff: 4 */ kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
    /* 0xc2-0xdf: 5 */ kA, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
    /* 0xe0: 6 */ kC, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
    /* 0xe1-0xec, 0xee-0xef: 7 */ kB, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
    /* 0xed: 8 */ kD, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
    /* 0xf0: 9 */ kF, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
    /* 0xf1-0xf3: 10 */ kE, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
    /* 0xf4: 11 */ kG, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR
    };

    #define BYTE_CLASS(b) (byte_class_table[(unsigned char)b])
    #define NEXT_STATE( b, cur) (state_table[ (BYTE_CLASS(b) * kNumOfStates) + (cur)])


    int IsUTF8Text(const char* utf8, int len)
    {
    utf8_state current = kSTART;
    int i;
    const char* pt = utf8;
    for(i = 0; i < len ; i++, pt++)
    {
    current = NEXT_STATE(*pt, current);
    if (kERROR == current)
    return 0;
    }
    return (current == kSTART) ? -1 : 0;
    }

    References

    Change History:





    This archive was generated by hypermail 2.1.5 : Tue Feb 25 2003 - 18:33:58 EST