Mike,
>
> Here is a way of representing the abstract character itself, using its
> scalar value:
> * in Unicode notation: U-00212B
ISO/IEC 10646-1:2000, Clause 6.5 Identifiers for characters (( derived from
Amendment 9 )) specifies the following syntax for the "short identifier":
{ U | u }[ {+}xxxx | {-}xxxxxxxx ]
That implies the following options:
212B U212B u212B +212B U+212B u+212B
0000212B U0000212B u0000212B -0000212B U-0000212B u-0000212B
The editors of the Unicode Standard chose not to make use of all of those
options -- particularly the forms prefixed merely with "+" or "-", which
look confusingly like signed integers. The array of options used in
the Unicode Standard, as documented in the Notations section are:
212B U+212B
0000212B U-0000212B
Note that the "U-" notation of (what will come to be called) the UTF-32
form always uses 8 hex digits. It is conceivable that five and/or six
digit hex forms will be introduced in the near future, since nobody really
wants to keep writing all the extra leading zeroes. But as it stands
currently, 5- or 6-digit shortened forms are not officially used in the
documentation for the standard.
>
> In UTF-16, each 16-bit code value in the 0x0..0xC7FF range and the
^
0xD7FF
> 0xD800..0xFFFF range directly corresponds to the same scalar value, while a
^
0xE000
> "surrogate" pair of 16-bit code values algorithmically represents a single
> scalar value in the range 0x010000..0x10FFFF. The first half of the pair is
> always in the 0xD000..0xD7FF range, and the second half of the pair is in
^
0xD800..0xDBFF
> the 0x0..0xFFFF range. Unicode 3.0 and ISO/IEC 10646-1;2000 have adopted the
^
0xDC00..0xDFFF
> UTF-16 mechanism as the only official usage of the 0xD000..0xD7FF scalar
^
0xD800..0xDFFF
> range.
>
>
> Here are various ways of representing the proposed abstract character named
> "GOTHIC LETTER Q" (which will probably be assigned to the Unicode scalar
^
GOTHIC LETTER QAITHRA (=Q)
> value 0x10335):
> * in Unicode notation, by its Unicode scalar value: U-010335
^
U-00010335
> * as a UCS-4 code value sequence, in hex notation: 0x00010335
> * as a UCS-2 code value sequence: illegal; out of range
> * as a UTF-16 code value sequence, in hex notation: 0xD800 0x0336
^
0xD800 0xDF35
> * in Unicode notation, by its Unicode value pair: U+D800 U+0336
^
U+D800 U+DF35
> * in EBNF notation: \u212B \u0336
^
\uD800 \uDF35
> * as a UTF-8 code value sequence, in hex notation: 0xF0 0x90 0x8c 0xB5
>
Other than these fixes, this text looked quite accurate to me.
--Ken
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT