RE: Implementing NFC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Mar 17 2007 - 17:02:48 CST

  • Next message: Peter Constable: "RE: Vista Fonts"

    From Doug Ewell:
    > But that list of "other processes" includes most software products on
    > the market. Very few text processors really support normalization.
    > It's disappointing how many vendors, even today, think "Unicode support"
    > means the ability to read and write files in UTF-16 or UTF-8.

    And that's why normalization is useful. Only for enahcing the compatibility
    with non-Unicode conforming processes. If there were only Unicode conforming
    processes, then all 4 standard normalizations would not be needed, because
    these applications would all produce canonically equivalent results for any
    canonically equivalent input, so (for example) the order of input of accents
    in Vietnamese would not matter as all composition forms and orders would
    generate the same result that would behave the same everywhere.

    In reality, IDN (that restricts the set of characters it can accept
    internally) could live without normalization to some form. If IDN references
    one normalization form (in addition to other rules like those for the
    unification of dashes, spaces, or the prohibition of most controls), it's
    only in order to be able to compute a unique domain name from any
    canonically equivalent input: it's simpler to describe the algorithm if you
    fix an intermediate normalization form.

    Remember that IDN is not producing strings encoded in any UTF: the strings
    it generates are made for working in another standard: the DNS system which
    has its own requirements. If you decode the IDN name directly, it is not
    even guaranteed to be in any normalized form standardized by Unicode, or
    even usable directly in other non Unicode-conforming processes without prior
    transformation into another normalization form.

    Conclusion: normalization forms are not essential for the Unicode standard
    itself. It is just a CONVENIENT way to transform the problem of identifying
    canonically equivalent strings (required in all Unicode-conforming
    processes) into the much simpler problem of comparing binary-identical
    strings that are in the same normalization form. So normalization is only
    needed there as an intermediate result, but not even needed for interchange
    with other conforming processes. This means that Unicode texts can be stored
    in files without normalizing them.

    Note that today, the modern text renderers should all render the same way
    two canonically equivalent texts that are in different normalization forms
    or even not normalized: the renderer will use normalization of its input as
    an intermediate format, but does not even need to use one of those
    normalized forms if it prefers another one: as long as the modified
    normalized form respects the canonical equivalence, the renderer will be
    conforming if it produce identical rendering from any canonically equivalent
    texts.

    For renderers, it is highly notable that the classes of texts that generate
    identical rendering will be larger than the classes of Unicode canonically
    equivalent texts. This is notable for Indic renderers, but also when
    rendering canonically different texts (like: Latin A, Greek A, Cyrillic A).
    Still such rendering process is conforming because it does note produce text
    that has a wrong interpretation for the reader (so a Greek A input does not
    look like a Greek B when rendered, so that the reader would think it
    certainly cannot be a Greek A), and because its classes of texts with
    equivalent rendering will contain *only* complete classes of canonically
    equivalent texts.



    This archive was generated by hypermail 2.1.5 : Sat Mar 17 2007 - 17:04:50 CST