From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Nov 14 2003 - 15:12:08 EST
Alexandre,
> Philippe Verdy wrote:
>
> > From: "Kent Karlsson" <kentk@cs.chalmers.se>
> >
> >>Philippe Verdy wrote:
> >>
> >>> (1) a singleton (example the Angström symbol, canonically
> >>>mapped to A with diaeresis,
> >>
> >>The Ångström (note spelling) sign is canonically mapped to
> >>capital a with ring.
>
> Thanks for all explanations,
Please disregard Philippe's misleading blatherings on this
topic.
The place to start is to read Unicode Technical Report #20,
Unicode in XML and other Markup Languages (despite Philippe's
disclaimers about it).
See, in particular, Section 5 of that report, "Characters
with Compatibility Mappings", which provides a series
of recommendations for things to do and not to do for
compatibility characters in an XML context.
>
> Keeping the A with ring exemple, does it means that compatibility
> characters can be identified according to Unicode charts ?
See section 2.3 Compatibility Characters in the Unicode Standard:
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
In general, compatibility characters cannot be identified
simply by looking at the Unicode code charts. The subset
of compatibility characters known as compatibility composite
characters *can* be identified by their decompositions listed
in the names list sections of the Unicode code chart. Or you
can parse them mechanically out of the UnicodeData.txt file
in the Unicode Character Database online.
U+212B ANGSTROM SIGN *is* a compatibility character in the
first sense defined in Section 2.3 of the standard. It is
not, however, a compatibility composite character.
> By exemple, in the case of \u212B ANGSTROM SIGN, it is documented :
> "preferred representation is 00C5 Å latin capital letter a with ring".
>
> Is that a clear indication that \u212B is actually a compatibility
> character
No, it is not. Such comments occur regarding other characters
which may or may not be compatibility characters.
> and then should be, according to XML 1.1 recommandation,
> replaced by the \u00C5 character ?
The reason has to do with normalization. U+212B *is* a
compatibility character. It is *not* a compatibility
composite character. But the crucial factor is that it
has a singleton canonical decomposition. If you normalize
text data using Unicode normalization form NFC, as recommended
by the W3C, then U+212B with be replaced by U+00C5, as
a result of the normalization.
This stuff *is* rather confusing for people encountering it the
first time. But the above sources should help. Also see
the W3C working draft for the Character Model for the World Wide Web
1.0:
--Ken
This archive was generated by hypermail 2.1.5 : Fri Nov 14 2003 - 16:06:50 EST