From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Mar 17 2007 - 17:02:48 CST
From Doug Ewell:
> But that list of "other processes" includes most software products on
> the market. Very few text processors really support normalization.
> It's disappointing how many vendors, even today, think "Unicode support"
> means the ability to read and write files in UTF-16 or UTF-8.
And that's why normalization is useful. Only for enahcing the compatibility
with non-Unicode conforming processes. If there were only Unicode conforming
processes, then all 4 standard normalizations would not be needed, because
these applications would all produce canonically equivalent results for any
canonically equivalent input, so (for example) the order of input of accents
in Vietnamese would not matter as all composition forms and orders would
generate the same result that would behave the same everywhere.
In reality, IDN (that restricts the set of characters it can accept
internally) could live without normalization to some form. If IDN references
one normalization form (in addition to other rules like those for the
unification of dashes, spaces, or the prohibition of most controls), it's
only in order to be able to compute a unique domain name from any
canonically equivalent input: it's simpler to describe the algorithm if you
fix an intermediate normalization form.
Remember that IDN is not producing strings encoded in any UTF: the strings
it generates are made for working in another standard: the DNS system which
has its own requirements. If you decode the IDN name directly, it is not
even guaranteed to be in any normalized form standardized by Unicode, or
even usable directly in other non Unicode-conforming processes without prior
transformation into another normalization form.
Conclusion: normalization forms are not essential for the Unicode standard
itself. It is just a CONVENIENT way to transform the problem of identifying
canonically equivalent strings (required in all Unicode-conforming
processes) into the much simpler problem of comparing binary-identical
strings that are in the same normalization form. So normalization is only
needed there as an intermediate result, but not even needed for interchange
with other conforming processes. This means that Unicode texts can be stored
in files without normalizing them.
Note that today, the modern text renderers should all render the same way
two canonically equivalent texts that are in different normalization forms
or even not normalized: the renderer will use normalization of its input as
an intermediate format, but does not even need to use one of those
normalized forms if it prefers another one: as long as the modified
normalized form respects the canonical equivalence, the renderer will be
conforming if it produce identical rendering from any canonically equivalent
texts.
For renderers, it is highly notable that the classes of texts that generate
identical rendering will be larger than the classes of Unicode canonically
equivalent texts. This is notable for Indic renderers, but also when
rendering canonically different texts (like: Latin A, Greek A, Cyrillic A).
Still such rendering process is conforming because it does note produce text
that has a wrong interpretation for the reader (so a Greek A input does not
look like a Greek B when rendered, so that the reader would think it
certainly cannot be a Greek A), and because its classes of texts with
equivalent rendering will contain *only* complete classes of canonically
equivalent texts.
This archive was generated by hypermail 2.1.5 : Sat Mar 17 2007 - 17:04:50 CST