Re: UTF-8 validation rules

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Sun Sep 09 2001 - 19:22:20 EDT


-----BEGIN PGP SIGNED MESSAGE-----

Kenneth Whistler wrote:
> Carl,
> > \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.
>
> In current parlance (see Unicode 3.1, UAX #27), these are
> "noncharacters", and you must account for the fact that
> U+1FFFE..U+1FFFF
> U+2FFFE..U+2FFFF
> ...
> U+10FFFE..U+10FFFF
>
> all have the same status as noncharacters.
>
> With Unicode 3.2 (in the works), the 32 additional code points
> at U+FDD0..U+FDEF go from unallocated status to noncharacters
> as well.

Those are non-characters in Unicode 3.1 (see D7b in UAX #27).

Carl W. Brown wrote:
| ... It seems like an interesting range for non-characters.

It's for Arabic presentation forms internal to a rendering implementation,
I assume (although it's not clear why existing private-use characters
couldn't have been used for that).

Kenneth Whistler wrote:
> UTF-8 (and UTF-16 and UTF-32) convertors must allow the conversion
> of noncharacter code points, but may then allow the detection of
> their noncharacter status.

Where does the standard say that conversion of these code points must
be allowed? That would make it impossible to strictly comply with both
Unicode 3.1 and ISO/IEC 10646-1:2000, since the latter says that U+FFFE
and U+FFFF (but not other non-characters) are illegal in UTF-8 and must
be rejected.

As far as I understand, according to Unicode 3.1, non-characters may be
*either* converted or rejected.

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBO5v5gTkCAxeYt5gVAQGuKwf/QIrfzIcrbxhUiTH3MTZVIn92UfXv6g7L
HNXdK7Dt4eBauBNf8Dx3d9ZfLIEBFL2BobMoSbclLPyyWv/5tVKc4W1U3TOXvc9m
xxAEVEgaW4pJKG63TKERANaf1xDfIlyIQk+APNMxLzlwUN9I0ENKV5d91BHp8F9y
lj5OGBWHRzjZwbtPT+Y9/Bx5/8l9+6jp4ZtFPrqFe9q7QCAg9+WTY1L3FdYgQiDK
/jtl8y2cPG0jHQ/DQul6spnZPZqEItDbfLeaDCu9minCcQ4Lscb9n+kayOQV/S0D
kVQbgIB9q7KXmYlY0CsYtNnRfARFS59yGwYnoVc352ZPS8OALoE12g==
=tVxi
-----END PGP SIGNATURE-----



This archive was generated by hypermail 2.1.2 : Mon Sep 10 2001 - 22:51:37 EDT