From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Feb 23 2009 - 13:04:37 CST
> I think the point that David is making is that your numbers only show
> "optimized performance for the overwhelming majority" and show nothing
> about "acceptable performance for everything". Since your two sample
> texts don't test out the badly performing areas of "everything", using
> only the data presented on your page the reader can not conclude the
> latter.
One thing folks concerned about this could do is run benchmarks
with various implementations over the well-known data set available
in:
http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
That contains content that is *deliberately* very, very far
from the ~99.98% of web HTML page content in NFC measure that
Mark has determined empirically for the web as a whole.
NormalizationTest.txt contains lots of bizarre edge cases and
lots of non-NFC data, specifically to ensure that implementations
of Unicode Normalization catch the corner cases.
As for Asmus' call to "pick one of the languages and one of the
data formats that give the most scope to actually exercise the
normalization part of the implementation algorithm", one
suggestion I would have would be to try focussing on Vietnamese.
Vietnamese now has a significant (and growing) web presence,
much of it in UTF-8 (cf. http://www.sgtt.com/vn/), and
Vietnamese is one of the few major languages widely implemented
that makes significant use of multiple combining marks with
a single base character. Furthermore, while opinions vary,
the preferred representation of Vietnamese is often taken as
using precomposed characters for all of the basic vowels,
but then combining marks for the tones -- in that format,
Vietnamese data would be neither in NFC nor in NFD. So it
may be possible to turn up significant data corpora for Vietnamese
which are not in a Unicode normalization form, although the
impetus for most web data to be in NFC anyway might mean that
the Vietnamese websites are already skewed that way, despite
any a priori preferences for text representation.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 13:19:37 CST