I have a few comments on the discussion.
Surrogates (UTF-16)
- Since only private use characters are encoded as surrogates now, there
is no market pressure for implementation yet. It will probably be a
couple of years before surrogates are fully supported by a variety of
platforms.
- As stated, the goal of Unicode is not to encode glyphs, but
characters. Over a million possible codes is far more than enough for
this goal. Unicode is *not* designed to encode arbitrary data. If you
wanted, for example, to give each "instance of a character on paper
throughout history" its own code, you might need trillions or
quadrillions of such codes; noble as this effort might be, you would not
use Unicode for such an encoding.
- No proposed extensions of UTF-16 to more than 2 surrogates has a
chance of being accepted into the Unicode Standard or ISO/IEC 10646.
Private Use Zone
- No conformant Unicode implementation can use the unencoded values
outside of the private use area. Only the values in the private use
areas (E000..F8FF, E0000..10FFFE) are legal for private assignment.
However, this is over 137,000 code points, which should be more than
ample for the vast majority of implementations.
[E0000..10FFFE are represented by surrogate pairs with private-use high
surrogates (DB80..DBFF).]
- For a particular implementation, if someone really wanted a
representation that encoded more characters in a series of 16-bit code
units then a series of private-use characters would work. For example,
suppose you use a representation that consisted one BMP private-use
character followed by one private-use surrogate pair (e.g. 3 16-bit
units). With such a representation, you can encode 6400 x 131,072 ( =
838,860,800) code points.
Mark
-- business: mark.davis@us.ibm.com, mark@unicode.org personal: mark@macchiato.com, http://www.macchiato.com --
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT