I like Ken's definition. It is no good to introduce another concept as code unit as most developers and
users are used to code point for a 16-bit coding value.
Regards,
Jianping.
Mark Davis wrote:
> One of the very few times I have to correct Ken:
>
> D841 is a code unit in UTF-16
> DF00 is a code unit in UTF-16
> 10300 is a code point (aka scalar value) in the Unicode codespace. It is represented by the code units:
>
>   F0 90 8C 80 in UTF-8 (four 8-bit units)
>   D800 DF00 in UTF-16 (two 16-bit units)
>   00010300 in UTF-32 (one 32-bit unit)
>
> [from my handy dandy code converter at http://www.macchiato.com/mark/UnicodeConverter]
>
> Ken is right that a code point will only have a name if it is assigned.
>
> Mark
>
> Kenneth Whistler wrote:
>
> > Viranga asked:
> >
> > >       I have 4 questions about character names:
> >
> > Mark Davis, John Jenkins, and Markus Scherer addressed many of these
> > questions. And I do suggest you take a look at the ICU implementations,
> > so you don't have to reinvent the wheel here.
> >
> > I just have a couple clarifications of terminology for you.
> >
> > >
> > >       (1) how does one figure out the character names of the code points
> > >           (in ranges in the UnicodeData.txt file)?
> >
> > "code points" do not have character names in the Unicode Standard.
> >
> > The thing that gets an associated character name is an "encoded character."
> >
> > This may seem like a quibble, but it actually becomes important when you
> > consider surrogate code points.
> >
> > 00C0 is a code point in the Unicode codespace.
> >
> > The abstract character "capital A with a grave accent" is encoded at
> > that code point (00C0).
> >
> > The encoded character U+00C0 has the normative character name "LATIN CAPITAL
> > LETTER A WITH GRAVE".
> >
> > Now for surrogates:
> >
> > D841 is a code point in the Unicode codespace.
> > DF00 is a code point in the Unicode codespace.
> > 10300 is a code point in the Unicode codespace.
> >
> > D841 and DF00 are surrogate Unicode values. They cannot be assigned to
> > abstract characters (individually), and because no encoded character is
> > ever associated with them (individually), they also have no character
> > names.
> >
> > The abstract character "the first letter of the Etruscan alphabet" will soon
> > be encoded at the code point, 10300.
> >
> > That encoded character U-00010300 will have the normative character name
> > "ETRUSCAN LETTER A".
> >
> > In the encoding form, UTF-16, U-00010300 ETRUSCAN LETTER A is represented
> > by the surrogate pair D841 DF00 (a sequence of two 16-bit Unicode values).
> >
> > >
> > >           ...and also for the private use ranges
> > >               (which we'll probably be needing).
> >
> > As John Jenkins pointed out, private use code points also have no
> > character names.
> >
> > >       (2) how do I locate the ISO/IEC character naming guidelines?
> > >           I looked in "The Unicode Standard Version 3.0" and it refers
> > >           me to Informative Annex K of ISO/IEC 10646.  Is the information
> > >           available electronically?  I looked at the ISO site and it said
> > >           that "there is no electronic access to the contents of ISO
> > >           standards" (http://www.iso.ch/infoe/faq.htm#Standards).  It did
> > >           mention that this was in the pipeline, but didn't say when.
> >
> > You have to buy the standard from ISO or a national standards body to
> > get the official thing. SC2 is working on getting an online version
> > available, but there are problems regarding which version of the standard
> > it will be.
> >
> > >       (3) when surrogates are introduced, will there be mappings from
> > >           surrogate pairs to character names?   Will they be included
> > >           in later versions of UnicodeData.txt?
> >
> > I concur with Mark Davis here. It is most likely that UnicodeData.txt will
> > simply be extended to use 5 digit Unicode scalar value representations of
> > encoded characters from Planes 1, 2, and 14, once they are added to the
> > standard.
> >
> > >       (4) why are they called "character names" and not "code point names"?
> >
> > See the explanation above.
> >
> > --Ken Whistler
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT