Re: UTF-8 ill-formed question from Ian Clifton on 2012-12-11 (Unicode Mail List Archive)

From: Ian Clifton <ian.clifton_at_chem.ox.ac.uk>
Date: Tue, 11 Dec 2012 20:59:20 +0000

From: James Lin <James_Lin_at_symantec.com>
> Hi
> Does anyone know why ill-form occurred on the UTF-8? besides it
> doesn't follow > the pattern of UTF-8 byte-sequences, i just
> wondering how or why?

There’s a lot about the conditions for the well‐formedness of UTF-8
sequences in Chapter 3 of the Standard:

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

Basically, a header byte starting with 𝑛 1-bits (2 ≤ 𝑛 ≤ 4) and a 0-bit
must be followed by 𝑛−1 trailer bytes starting 10…, and that’s the only
place such trailer bytes should occur. Even if these conditions hold,
however, a UTF-8 sequence might still be ill‐formed, Table 3-7
exhaustively lists all the cases.

-- 
Ian ◎

Received on Tue Dec 11 2012 - 15:01:13 CST

This archive was generated by hypermail 2.2.0 : Tue Dec 11 2012 - 15:01:13 CST