Bernard Miller scripsit:
> I’m afraid I have a little bit of a beef about the
> Unicode documentation here, forgive me if this has
> already been brought up. How come UAX #27 says that
> Unicode 3.0 had 34 non characters, 32 of which are in
> supplementary planes? First of all, there are no
> characters defined in supplementary planes in Unicode
> 3.0.
Correct. However, the codepoints FFFE and FFFF in
*every* plane have been non-characters since Unicode
2.0 or even earlier. They were mentioned in ISO 10646
if not in Unicode itself.
> How many planes are defined in Unicode 3.1? UAX #27
> seems to indicate that it depends on what
> transformation format is used (“A process shall
> interpret the Unicode code units in accordance with
> the Unicode Transformation Format used.”). UTF-8 seems
> to only define 17 planes but UTF-32 seems to have 128
> groups of 256 planes.
There are only 17 planes, period. Code units in UTF-32
greater than 0x10FFFF are not valid codepoints.
> UAX #27 says that Unicode 3.1
> defines 3 new supplementary planes... including plane
> 14. I have difficulty with that statement.. does that
> mean that there are only 3 new planes, or that there
> are (at least) 14 new planes, but only 3 of which have
> plane names and characters in them? At least 17 planes
> must be defined in order to define the 32 non
> characters in 16 supplementary planes, that’s what
> common sense would say anyway.
Unicode 3.1 defined characters in three of the
existing 16 supplementary planes. The planes themselves
have been here since 2.0.
> This whole “plane” business suffers from a lack of
> documentation. UAX #27 talks about planes as if it’s
> ancient history but the Unicode 3.0 book does not
> mention planes once (it’s not in the index anyway). I
> would like the Unicode documentation to explain
> exactly what a plane is without requiring the 10646
> documentation which is only available for a fee. In
> fact, according to UAX #27 the planes are defined in
> terms of what WILL be in 10646-2.
A plane is a sequence of 65536 Unicode scalar values,
in the terminology of Unicode 2.0, on a divisible-by-65536 boundary.
> I’m trying to get a grasp on exactly how many planes
> are defined in Unicode in part because it seems to
> affect the number of non characters that are defined.
> I also want to know the maximum number of characters
> that Unicode can encode. So far I reckon there are
> 1114112 (assuming 17 planes) minus 2048 (half
> surrogates) minus 2 (special non characters) minus 32
> (“hidden” non characters) minus 32 (non characters due
> to some arbitrary association between 16 higher planes
> code values and the special non characters code
> values) = 1111998 code positions available for
> characters.
Your reasoning is sound.
> What’s with this 1114111 number I’ve seen
> on this list?
I have no clue.
> BTW, it doesn’t make sense for every code position
> ending in FFFF or FFFE to be a non character.
It doesn't make much sense, but it is the rule anyway.
> Why isn’t the same rule applied to the “hidden” non
> characters, so that every code value ending in FDD0 to
> FDEF is also a non character? Is it to contribute to
> their “hidden” nature?
No. There is simply no reason to reserve them on the other planes.
-- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Please leave your values | Check your assumptions. In fact, at the front desk. | check your assumptions at the door. --sign in Paris hotel | --Miles Vorkosigan
This archive was generated by hypermail 2.1.2 : Mon Oct 01 2001 - 19:29:08 EDT