From: Phillips, Addison (addison@amazon.com)
Date: Fri Jul 04 2008 - 10:31:43 CDT
See Section 3.8 in the standard:
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G2212
In my experience, it is a lot clearer to folks if you do not refer to surrogate code points as anything other than reserved. UTF-16 uses code units to encode Unicode code points.
Formally, the code points in Unicode run from 0 through 0x10FFFF, so the surrogate code points are code points. However the code points between D800 and DFFF are reserved and do not encode characters. Section 3.9 says:
"Each encoding form maps the Unicode code points U+0000..U+D7FF and
U+E000..U+10FFFF to unique code unit sequences."
So, the surrogate pair (of code units) encodes a code point (U+20045 in your example).
Addison
Addison Phillips
Globalization Architect -- Lab126
Internationalization is not a feature.
It is an architecture.
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
> On Behalf Of Jeroen Ruigrok van der Werven
> Sent: Friday, July 04, 2008 12:09 AM
> To: Doug Ewell
> Cc: Unicode Mailing List
> Subject: Re: UTF-16 clarification needed
>
> -On [20080704 08:47], Doug Ewell (dewell@roadrunner.com) wrote:
> >They are both UTF-16 code units and code points. They are not
> Unicode
> >scalar values.
>
> OK, and when you have them together in a surrogate pair, do you
> call it a
> pair of code units or can you also call them a pair of code points?
>
> --
> Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> /
> asmodai
> イェルーン ラウフロック ヴァン デル ウェルヴェン
> http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
> A wise man that walks in the dark with a blindfold on, is not much
> of a
> wise man...
This archive was generated by hypermail 2.1.5 : Fri Jul 04 2008 - 10:34:39 CDT