From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 07 2008 - 15:02:04 CDT
Doug Ewell responded:
> But they do exist as ***code points***. TUS is clear there too, in
> definitions D9 and D10.
Correct.
>
> I'd like to wait for Ken or Mark or somebody to issue a bull on this.
*hehe*
> I
> think I gave the correct answer to the question Jeroen asked, and you
> are giving the correct answer for the question you think Jeroen really
> meant to ask.
I think Addison answered the question well, actually. There isn't
a whole lot to add to that, but I'll maunder on, anyway...
Jeroen's follow-up question was:
> OK, and when you have them together in a surrogate pair, do you call it a
> pair of code units or can you also call them a pair of code points?
The way to think about this clearly is to specify the *context* in
which you "have a surrogate pair".
If you are talking about a UTF-16 string, then what that string consists
of (if well-formed) is a sequence of UTF-16 code *units*. In that
context:
<0041 D840 DC45 0041>
is a sequence of 4 UTF-16 code units, two of which constitute a
well-formed surrogate pair.
That UTF-16 string can be *interpreted* as a sequence of Unicode
code points, namely:
<0041, 20045, 0041>
and from the standard, we know that the code point value 0041
(or U+0041) represents LATIN CAPITAL LETTER A and the code point
value 20045 (or U+20045) represents CJK UNIFIED IDEOGRAPH-20045.
In the context of your UTF-16 string, by the definition of UTF-16
(D91), the isolated code unit value of D840 by itself can*not*
be intrepreted as Unicode code point -- it is only part of the
surrogate pair, i.e. part of a sequence of bits that together
represent U+20045.
However, outside of the context of your UTF-16 string, and
considered in the context of the architecture of the overall
standard, U+D840 certainly *is* a code point. It has a designated
function, but that function requires that it never be assigned
an abstract character.
When you are talking about UTF-16 strings, however, you are
best to simply ignore the status of U+D840 in the overall
architecture. In UTF-16, D840 by itself is no more meaningful than
would be a BF byte value by itself in UTF-8.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 15:03:36 CDT