From: Phillips, Addison (addison@amazon.com)
Date: Sun Feb 22 2009 - 23:18:16 CST
The Mac thing is overblown. Macs use NFD in their filesystems--- but they don't generate any more non-NFC content than any other system in file content. So Mark's data is entirely reasonable and within expectations... for the Web as a whole.
There are languages for which "0.02%" is not a useful metric (hence the whole impetus for a FAQ). But as an overall measure, it's not a surprising number. Note that languages for which non-normalized data is likely to appear are also likely to be disadvantaged languages with *comparatively* small presence on the Internet today.
Addison
Addison Phillips
Globalization Architect -- Lab126
Internationalization is not a feature.
It is an architecture.
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
> On Behalf Of Doug Ewell
> Sent: Sunday, February 22, 2009 6:13 PM
> To: Unicode Mailing List
> Subject: Re: NFC FAQ
>
> Mark Davis wrote:
>
> > an illustrative sample simulating documents would be
> >
> > simulating content:
> >
> > 999,800 characters (82% being ASCII, then Cyrillic, Han, Arab,
> other
> > Latin, ...) not needing normalization, and
> >
> > 200 characters needing normalization,
>
> If you did happen to run into some data that started out in NFD --
> say,
> generated on a Mac -- you'd have a lot more than 0.02% of content
> characters needing normalization.
>
> --
> Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
> http://www.ewellic.org
> http://www1.ietf.org/html.charters/ltru-charter.html
> http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
>
This archive was generated by hypermail 2.1.5 : Sun Feb 22 2009 - 23:21:35 CST