From: Mark Davis (mark.edward.davis@gmail.com)
Date: Mon Feb 23 2009 - 15:18:24 CST
That's actually what I had thought as well, until I looked at the data. As
it turns out, Vietnamese in HTML docs is essentially all NFC (99.9%
characters definitely NFC [NFC_QC=Yes]). Bengali, with about 6 times fewer
characters in HTML files on the web, has the lowest proportion: with the NFC
Quick-Check property we get about 1.5% characters being No and about 8.75%
Maybe.
(Normal caveats about sample space, changes over time, algorithmically
detected language, etc).
Mark
On Mon, Feb 23, 2009 at 11:04, Kenneth Whistler <kenw@sybase.com> wrote:
>
> > I think the point that David is making is that your numbers only show
> > "optimized performance for the overwhelming majority" and show nothing
> > about "acceptable performance for everything". Since your two sample
> > texts don't test out the badly performing areas of "everything", using
> > only the data presented on your page the reader can not conclude the
> > latter.
>
> One thing folks concerned about this could do is run benchmarks
> with various implementations over the well-known data set available
> in:
>
> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
>
> That contains content that is *deliberately* very, very far
> from the ~99.98% of web HTML page content in NFC measure that
> Mark has determined empirically for the web as a whole.
> NormalizationTest.txt contains lots of bizarre edge cases and
> lots of non-NFC data, specifically to ensure that implementations
> of Unicode Normalization catch the corner cases.
>
> As for Asmus' call to "pick one of the languages and one of the
> data formats that give the most scope to actually exercise the
> normalization part of the implementation algorithm", one
> suggestion I would have would be to try focussing on Vietnamese.
> Vietnamese now has a significant (and growing) web presence,
> much of it in UTF-8 (cf. http://www.sgtt.com/vn/), and
> Vietnamese is one of the few major languages widely implemented
> that makes significant use of multiple combining marks with
> a single base character. Furthermore, while opinions vary,
> the preferred representation of Vietnamese is often taken as
> using precomposed characters for all of the basic vowels,
> but then combining marks for the tones -- in that format,
> Vietnamese data would be neither in NFC nor in NFD. So it
> may be possible to turn up significant data corpora for Vietnamese
> which are not in a Unicode normalization form, although the
> impetus for most web data to be in NFC anyway might mean that
> the Vietnamese websites are already skewed that way, despite
> any a priori preferences for text representation.
>
> --Ken
>
>
>
This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 15:21:17 CST