You are correct: there is a typo in that line: it should be 0x800 instead of
0x400. Thank you for bringing it to our attention.
Mark
P.S. A nice way to remember the number of bits in each form of UTF-8 is that
it is 5 bits / byte + 1, plus another 1 in the case of the single byte form.
That is, the 1-byte form gives you 7 bits, the 2-byte form gives you 11
bits, 3 byte gives 16, 4 byte gives 21.
Masahiko Maedera wrote:
> Dear, Mr. Mark Davis.
>
> Now I have found something wrong in the technical report 17.
>
> http://www.unicode.org/unicode/reports/tr17/
>
> > UTF-8 provides a good example:
> > ...
> > 0x80..0x3FF ---> 2 bytes
> > 0x400..0xD7FF, 0xE000..0xFFFF ---> 3 bytes
> > ...
>
> but, in the RFC 2279 UTF-8, the below is described.
>
> > 0000 0080-0000 07FF 110xxxxx 10xxxxxx
> > 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx ( excluding surrogate )
>
> Should it be modified as the following?
>
> > 0x80..0x7FF ---> 2 bytes
> > 0x800..0xD7FF, 0xE000..0xFFFF ---> 3 bytes
>
> Best regards,
> Masahiko
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT