Number |
UCS-4 range (hex.) |
UTF-8 octet sequence (binary) |
UTF-8 octet range (hex) |
| 1 |
0000 0000-0000 007F |
0xxxxxxx |
0x00-0x7f |
| 2 |
0000 0080-0000 07FF |
110xxxxx 10xxxxxx |
0xc0-0xdf 0x80-0xbf |
| 3 |
0000 0800-0000 FFFF |
1110xxxx 10xxxxxx 10xxxxxx |
0xe0-0xef 0x80-0xbf 0x80-0xbf |
| 4 |
0001 0000-001F FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
0xf0-0xf7 0x80-0xbf 0x80-0xbf 0x80-0xbf |
Number |
UCS-4 range (hex.) |
Non-shortest form (which is illegal) UTF-8 octet
sequence (binary) |
Non-shortest form (which is illegal) UTF-8 octet
range (hex) |
| 2 |
0000 0000-0000 007F |
1100000x 10xxxxxx |
0xc0-0xc1 0x80-0xbf |
| 3 |
0000 0080-0000 07FF |
11100000 100xxxxx 10xxxxxx |
0xe0 0x80-0x9f 0x80-0xbf |
| 4 |
0000 0800-0000 FFFF |
11110000 1000xxxx 10xxxxxx 10xxxxxx |
0xf0 0x80-0x8f 0x80-0xbf 0x80-0xbf |
Number |
UCS-4 range (hex.) |
Surrogate high and surrogate low range directly
map to UTF-8 octet sequence (binary) |
Surrogate high and surrogate low range directly
map to UTF-8 octet range (hex) |
| 3 |
0000 D800-0000 DFFF |
11101101 101xxxxx 10xxxxxx |
0xed 0xa0-0xbf 0x80-0xbf |
Number |
UCS-4 range (hex.) |
Illegal UTF-8 octet sequence (binary) represent
UCS4 value greater than 0x10FFFF |
Illegal UTF-8 octet range (hex) represent UCS4 value
greater than 0x10FFFF |
| 4 |
0011 0000-0011 FFFF |
11110100 1001xxxx 10xxxxxx 10xxxxxx |
0xf4 0x90-0x9f 0x80-0xbf 0x80-0xbf |
| 4 |
0012 0000-0013 FFFF |
11110100 101xxxxx 10xxxxxx 10xxxxxx |
0xf4 0xa0-0xbf 0x80-0xbf 0x80-0xbf |
| 4 |
0014 0000-0017 FFFF |
11110101 10xxxxxx 10xxxxxx 10xxxxxx |
0xf5 0x80-0xbf 0x80-0xbf 0x80-0xbf |
| 4 |
0018 0000-001F FFFF |
1111011x 10xxxxxx 10xxxxxx 10xxxxxx |
0xf6-0xf7 0x80-0xbf 0x80-0xbf 0x80-0xbf |
| Bytes never used by valid UTF-8: |
Reason |
0xFE-0xFF |
Is not used by any UTF-8 octet pattern, not leading
octet, neither trial octet |
0xF8-0xFB |
Represent 5 octets UTF-8 sequence which the definitation
obsoleted by Unicode 3.1 |
0xFC-0xFD |
Represent 6 octets UTF-8 sequence which the definitation
obsoleted by Unicode 3.1 |
0xC0-0xC1 |
Represent non-shortest form in 2 octet sequence if
followed by one 0x80-0xbf |
0xF5-0xF7 |
Represent value larger than U+10FFFF in 4 octet sequence
if followed by three 0x80-0xbf |
| Two byte paris never used in UTF-8: |
Reason |
0xc0-0xf7 0x00-0x7f |
Does not match with UTF-8 pattern |
0xe0 0x80-0x9f |
Represent non-shortest form in 3 octet sequence if followed by one 0x80-0xbf |
0xf0 0x80-0x8f |
Represent non-shortest form in 4 octet sequence if
followed by two 0x80-0xbf |
0xed 0xa0-0xbf |
Represent range 0xd800-0xdfff in 3 octet sequence
if followed by one 0x80-0xbf |
0xf4 0x90-0xbf |
Represent value larger than U+10FFFF in 4 octet sequence if followed by one 0x80-0xbf |
/^(([\0-\x7F])|But in the mean time it does NOT match any of the following
([\xC0-\xDF][\x80-\xBF])|
([\xE0-\xEF][\x80-\xBF][\x80-\xBF])|
([\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]))*$/
([\xC0-\xC1])|This regular expression can be simplified to
([\xE0][\x80-\x9F])|
([\xF0][\x80-\x8F])|
([\xED][\xA0-\xBF])|
([\xF4][\x90-\xBF])|
([\xF5-\xF7])
/^(([\0-\x7F])
|([\xC2-\xDF][\x80-\xBF])
|((([\xE0][\xA0-\xBF])
|([\xE1-\xEC\xEE-\xEF][\x80-\xBF])
|([\xED][\x80-\x9F])
)[\x80-\xBF])
|((([\xF0][\x90-\xBF])
|([\xF1-\xF3][\x80-\xBF])
|([\xF4][\x80-\x8F])
)[\x80-\xBF][\x80-\xBF])
)*$/
| Change the state to |
Current State |
|||||||
| Input |
START |
A |
B |
C |
D |
E |
F |
G |
| 0x00-7F |
START |
ERROR |
ERROR |
ERROR |
ERROR |
ERROR |
ERROR |
ERROR |
| 0x80-0x8F |
ERROR |
START |
A |
A |
B |
B |
||
| 0x90-0x9F |
B |
ERROR |
||||||
| 0xA0-0xBF |
A |
ERROR |
||||||
| 0xC0-0xC1,0xF5-0xFF |
ERROR |
ERROR |
ERROR |
ERROR |
ERROR |
|||
| 0xC2-0xDF |
A |
|||||||
| 0xE0 |
C |
|||||||
| 0xE1-0xEC, 0xEE-0xEF |
B |
|||||||
| 0xED |
D |
|||||||
| 0xF0 |
F |
|||||||
| 0xF1-0xF3 |
E |
|||||||
| 0xF4 |
G |
|||||||