Re: NFC FAQ

From: Mark Davis (mark.edward.davis@gmail.com)
Date: Mon Feb 23 2009 - 15:18:24 CST

  • Next message: announcements@unicode.org: "[Unicode Announcement] Character Encoding Stability Update"

    That's actually what I had thought as well, until I looked at the data. As
    it turns out, Vietnamese in HTML docs is essentially all NFC (99.9%
    characters definitely NFC [NFC_QC=Yes]). Bengali, with about 6 times fewer
    characters in HTML files on the web, has the lowest proportion: with the NFC
    Quick-Check property we get about 1.5% characters being No and about 8.75%
    Maybe.

    (Normal caveats about sample space, changes over time, algorithmically
    detected language, etc).

    Mark

    On Mon, Feb 23, 2009 at 11:04, Kenneth Whistler <kenw@sybase.com> wrote:

    >
    > > I think the point that David is making is that your numbers only show
    > > "optimized performance for the overwhelming majority" and show nothing
    > > about "acceptable performance for everything". Since your two sample
    > > texts don't test out the badly performing areas of "everything", using
    > > only the data presented on your page the reader can not conclude the
    > > latter.
    >
    > One thing folks concerned about this could do is run benchmarks
    > with various implementations over the well-known data set available
    > in:
    >
    > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
    >
    > That contains content that is *deliberately* very, very far
    > from the ~99.98% of web HTML page content in NFC measure that
    > Mark has determined empirically for the web as a whole.
    > NormalizationTest.txt contains lots of bizarre edge cases and
    > lots of non-NFC data, specifically to ensure that implementations
    > of Unicode Normalization catch the corner cases.
    >
    > As for Asmus' call to "pick one of the languages and one of the
    > data formats that give the most scope to actually exercise the
    > normalization part of the implementation algorithm", one
    > suggestion I would have would be to try focussing on Vietnamese.
    > Vietnamese now has a significant (and growing) web presence,
    > much of it in UTF-8 (cf. http://www.sgtt.com/vn/), and
    > Vietnamese is one of the few major languages widely implemented
    > that makes significant use of multiple combining marks with
    > a single base character. Furthermore, while opinions vary,
    > the preferred representation of Vietnamese is often taken as
    > using precomposed characters for all of the basic vowels,
    > but then combining marks for the tones -- in that format,
    > Vietnamese data would be neither in NFC nor in NFD. So it
    > may be possible to turn up significant data corpora for Vietnamese
    > which are not in a Unicode normalization form, although the
    > impetus for most web data to be in NFC anyway might mean that
    > the Vietnamese websites are already skewed that way, despite
    > any a priori preferences for text representation.
    >
    > --Ken
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 15:21:17 CST