From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Feb 23 2009 - 01:55:11 CST
Mark,
I think Michael D. Adams has a point, which is worth considering.
NFC is clearly an important format and part of the reason to have
this FAQ is to convince people that normalization to NFC is not
prohibitive.
The full argument has two prongs. You've delivered the first one,
that is, "If I had to normalize the web, what's the complexity of the
task?". Your answer, of course, is that most of the web is close to
NFC, so on average, little work remains to be done beyond
verification (quick check).
The second question people want to be reassured is as follows:
"Is there some known data format that, for data in some language,
forces NFC to be unacceptably slow, if I have to predominantly
process data from that language?"
I believe that the answer to that question is also largely positive,
because most languages (or data formats) don't produce infinitely
long runs of combining characters that need composition or
reordering.
European data in NFD, which I suspect is not an actual worst
case in that light, produces about 10% characters that need
combination, but, as doubly accented characters are not
part of the usual alphabets, there's little scope for reordering.
Any implementation that fast-tracks the remaining 90% of
characters in such data is still going to be fast. And any
dilution of such data with HTML/XML keys is going to
improve matters.
However, in order to win over people who harbor doubts,
it would be useful, if you (or people with experience of
challenging combinations of language and data format)
could discuss what "realistic" worst cases might look like
and discuss how that would affect the performance in
situations where such data were to dominate.
I suspect that the answer is that the answers would still
point to encouragingly low upper bounds, but at the moment,
the argument's second prong is not finished.
A./
This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 01:56:58 CST