From: Mark Davis (mark.edward.davis@gmail.com)
Date: Mon Feb 23 2009 - 00:20:40 CST
I agree firmly with Addison.
NFC is very important for languages whose text routinely can be in different
equivalent forms. But for determining a strategy for implementation, it is
important to measure over all expected cases. And performance for
tokenizing, even for languages that such as those, is not that bad.
The top characters that are not definitely NFC may surprise some people;
they are:
drum roll...
1. BENGALI VOWEL SIGN AA
2. TAMIL VOWEL SIGN AA
3. MALAYALAM VOWEL SIGN AA
Mark
On Sun, Feb 22, 2009 at 21:18, Phillips, Addison <addison@amazon.com> wrote:
> The Mac thing is overblown. Macs use NFD in their filesystems--- but they
> don't generate any more non-NFC content than any other system in file
> content. So Mark's data is entirely reasonaable and within expectations...
> for the Web as a whole.
>
> There are languages for which "0.02%" is not a useful metric (hence the
> whole impetus for a FAQ). But as an overall measure, it's not a surprising
> number. Note that languages for which non-normalized data is likely to
> appear are also likely to be disadvantaged languages with *comparatively*
> small presence on the Internet today.
>
> Addison
>
> Addison Phillips
> Globalization Architect -- Lab126
>
> Internationalization is not a feature.
> It is an architecture.
>
>
> > -----Original Message-----
> > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
> > On Behalf Of Doug Ewell
> > Sent: Sunday, February 22, 2009 6:13 PM
> > To: Unicode Mailing List
> > Subject: Re: NFC FAQ
> >
> > Mark Davis wrote:
> >
> > > an illustrative sample simulating documents would be
> > >
> > > simulating content:
> > >
> > > 999,800 characters (82% being ASCII, then Cyrillic, Han, Arab,
> > other
> > > Latin, ...) not needing normalization, and
> > >
> > > 200 characters needing normalization,
> >
> > If you did happen to run into some data that started out in NFD --
> > say,
> > generated on a Mac -- you'd have a lot more than 0.02% of content
> > characters needing normalization.
> >
> > --
> > Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
> > http://www.ewellic.org
> > http://www1.ietf.org/html.charters/ltru-charter.html
> > http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
> >
>
>
>
>
This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 00:24:26 CST