Your letter makes clear that Unicode needs to do a better job of
identifying the preferred character code for many situations. The
information is there to a large extent, but buried in the fine print or in
data tables.
You will see that there is a canonical decomposition from U+212B to U+00C5.
This means that once people use Normalization in a widespread fashion, it
will become practically impossible to maintain a distinction between these
two codes.
The inclusion of the U+212B is due to historic reasons.
Many other characters have been included in Unicode over the years for
legitimate purposes as compatibility characters (to allow round trip
conversion to/from important legacy character sets).
These have all been given compatibility decompositions.
Unfortunately, many characters that have legitimate uses in a legacy-free
environment, have also been given compatibility mappings at some time. This
makes it very hard to use this information in its current form to identify
cases when a distinction between characters should be kept or when not.
There is some very explicit guidance, however, in Unicode TR#20 (Unicode and
XML). The information there is readily applicable to other environments, if
you pay attention to the rationale for each recommendation and evaluate
whether it applies in your specific case.
A./
PS:
>"Ångström" is spelled wrong on the code charts at Unicode's home page, BTW.
Can you cite the page number and approximate location on the page (please
send this information to me and kenw@sybase.com, not to the whole list).
This archive was generated by hypermail 2.1.2 : Thu Sep 13 2001 - 15:06:39 EDT