From: Dean Snyder (dean.snyder@jhu.edu)
Date: Fri May 20 2005 - 10:32:58 CDT
Philippe Verdy wrote at 5:04 PM on Friday, May 20, 2005:
>From: "Dean Snyder" <dean.snyder@jhu.edu>
>> By the way, can you indeed tell us what the "unique status" of the code
>> unit 0xDF02 is? And if it has one, why it is not spelled out in the
>> standard?
>
>It is in the standard:
>* the code unit 0xDF02 is a surrogate.
>* the codepoint U+DF02 is permenantly a non-character:
>* there's no assigned character on U+DF02, it will never be assigned to
>character by ISO/IEC 10646-1 or Unicode, because it is already bound to a
>non-character.
This does not define any unique status for 0xDF02; instead it defines a
status that 0xDF02 shares with all the other 1023 low surrogates. A
strange definition indeed of unique.
The interpretation of 0xDF02 is context-bound and that, by definition,
makes its "status" multiple, and therefore non-unique. Contrary to what
Ken has implied ["In UTF-16, 0xD800 does not set a "state" which then
alters the interpretation of a subsequent code unit"], the
interpretation of 0xDF02 IS directly influenced by its preceding high
surrogate. To put it another way, it is only the COMBINATIONS of high
and low surrogates that yield unique results.
Leaving out the BOM, the interpretations of all non-surrogate code units
in a UTF-16 text stream are context-free; the interpretations of all
surrogate code units in the same stream are context-bound. That is why I
am referring to surrogates as a stateful encoding mechanism, and subject
to fragment fragility.
Dean A. Snyder
Assistant Research Scholar
Manager, Digital Hammurabi Project
Computer Science Department
Whiting School of Engineering
218C New Engineering Building
3400 North Charles Street
Johns Hopkins University
Baltimore, Maryland, USA 21218
office: 410 516-6850
cell: 717 817-4897
www.jhu.edu/digitalhammurabi/
http://users.adelphia.net/~deansnyder/
This archive was generated by hypermail 2.1.5 : Fri May 20 2005 - 10:46:48 CDT