From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Dec 02 2002 - 21:21:52 EST
Christian Wittern asked:
> Leaving aside the red light that flashed in my head on the notion of
> the W3C recommending PUA (for interchange?), I was wondering about the
> notion of PUA characters being by "Unicode defaults" treated as
> ideographs. Is there a canonical reference for this?
>
> Just wondering,
Many Unicode "character" properties are actually code point
properties. They must partition the entire Unicode codespace,
so that an API can return a meaningful value for any code
point, including PUA and unassigned code points, not just
for assigned characters.
Because of this, the Unicode Standard now has a concept of
a default property value, which applies in code points which
are not otherwise given an explicit value for that property.
In the case of PUA characters, the Unicode Character Database
gives them all the same properties. Some of the most important of
those properties are:
gc=Co (general category = Private_Use)
ccc=0 (combining class = 0, i.e. Not_Reordered)
bc=L (bidi class = strong Left_To_Right)
sc=Zyyy (script = Common)
lb=XX (line break = Unknown)
ea=A (east asian width = Ambiguous)
For ideographs, which also all have the same properties, the
relevant, corresponding properties are:
gc=Lo (general category = Other_Letter)
ccc=0 (combining class = 0, i.e. Not_Reordered)
bc=L (bidi class = strong Left_To_Right)
sc=Hani (script = Han)
lb=ID (line break = Ideographic)
ea=W (east asian width = Wide)
Thus, while in some respects the PUA characters are, by default,
like ideographs (they are all base characters and are treated
as left-to-right for bidi purposes), in other respects, their
properties differ.
In particular, with respect to line-breaking, UAX #14 currently
states for lb=XX:
"The default behavior for [XX] is identical to class AL.
[i.e. alphabetic characters] ... In addition, implementations
can override or tailor this default behavior, e.g. by
assigning characters the property ID or another class, if that
is likely to give the correct default behavior for their users,
or use other means to determine the correct behavior. For example,
one implementation might treat any private use character in
ideographic context as ID, while another implementation
might support a method for assigning specific properties to
specific definitions of private use characters. The details of
such use of private use charaters are outside the scope of this
standard."
So I'd say that the XML Core WG has got the situation only
partially correct for Unicode PUA characters.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Dec 02 2002 - 22:01:07 EST