Re: compatibility characters (in XML context)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Nov 14 2003 - 15:32:22 EST

  • Next message: Magda Danish \(Unicode\): "FW: Web Form: Other Question, Problem, or Feedback"

    From: "Alexandre Arcouteil" <lex@free.fr>

    > Philippe Verdy wrote:
    >
    > > From: "Kent Karlsson" <kentk@cs.chalmers.se>
    > >
    > >>Philippe Verdy wrote:
    > >>
    > >>> (1) a singleton (example the Angström symbol, canonically
    > >>>mapped to A with diaeresis,
    > >>
    > >>The Ångström (note spelling) sign is canonically mapped to
    > >>capital a with ring.
    >
    > Thanks for all explanations,
    >
    > Keeping the A with ring exemple, does it means that compatibility
    > characters can be identified according to Unicode charts ?
    >
    > By exemple, in the case of \u212B ANGSTROM SIGN, it is documented :
    > "preferred representation is 00C5 Å latin capital letter a with ring".
    >
    > Is that a clear indication that \u212B is actually a compatibility
    > character and then should be, according to XML 1.1 recommandation,
    > replaced by the \u00C5 character ?

    You must not replace any character directly within a intermediate XML
    processing engine, unless this is clearly documented in its interface.
    Generally, XML-based interfaces will perform normalization (NFC or NFD) of
    their input string, but they are not required to do it. However it allows
    the engine to guarantee that its outputs from canonically equivalent strings
    will also be canonically equivalent (because normalizing on input guarantees
    identical output).

    Unicode conformance for an algorithm P that process a string and return a
    string just means that,
        for every two inputs A and B:
            if NFC(A)=NFC(B)
                or NFD(A)=NFD(B)
            then NFC(P(A))=NFC(P(B))
                and NFD(P(A))=NFD(P(B))
    A confirming algorithm does not require that its output be normalized.

    These constraints do not apply for XML conformance (normalization to NFC is
    recommanded, but not needed).

    So for XML, if you choose to apply or require NFC or NFD normalization, the
    only compatibility characters will be those Unicode characters that are
    mapped canonically to a singleton, and those canonicallly mapped to a pair
    but are excluded from recomposition.



    This archive was generated by hypermail 2.1.5 : Fri Nov 14 2003 - 16:11:48 EST