From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri May 20 2005 - 10:04:26 CDT
From: "Dean Snyder" <dean.snyder@jhu.edu>
> By the way, can you indeed tell us what the "unique status" of the code
> unit 0xDF02 is? And if it has one, why it is not spelled out in the
> standard?
It is in the standard:
* the code unit 0xDF02 is a surrogate.
* the codepoint U+DF02 is permenantly a non-character:
* there's no assigned character on U+DF02, it will never be assigned to
character by ISO/IEC 10646-1 or Unicode, because it is already bound to a
non-character.
Unicode works at the character level only, and only for plain text. Code
units are only part of serialization mechanisms to interchange text data in
memory or across systems. Code units are not plain-text, and even a file
encoded with UTF-16 codeunits is not necessarily plain-text, as it may
decode into a stream of codepoints not assigned to characters (i.e.
<reserved> until further assignment, or <non-character> like the surrogates
or U+FFFE and U+FFFF).
An application handling plain-text at the codepoint level will then never
see any codepoint whose value is 0xDF02. If this happens, there's a serious
bug in the (de)serialization routines that perform I/O over streams of code
units or of bytes (with encoding schemes): these routines are then
non-conforming.
(On the opposite, PUAs are assigned as Unicode characters.)
This archive was generated by hypermail 2.1.5 : Fri May 20 2005 - 10:05:07 CDT