From: Mark Davis (mark@macchiato.com)
Date: Sun Feb 22 2009 - 18:24:44 CST
The implementations I tested do revert to fast-paths where possible. Given
the data:
- ~99.98% of web HTML page *content* characters are definitely NFC.
- *Content *means after discarding markup, and doing entity
resolution.
- ~99.9999% of web HTML page *markup* characters are definitely NFC.
- Because so much of markup is plain ASCII.
an illustrative sample simulating documents would be
- simulating content:
- 999,800 characters (82% being ASCII, then Cyrillic, Han, Arab, other
Latin, ...) not needing normalization, and
- 200 characters needing normalization, and
- simulating markup:
- 999,999 characters (99.5% being ASCII, ...) not needing normalization,
and
- 1 character needing normalization.
However, since the main issue that the FAQ is aimed at is the normalization
of identifiers (like XML Name), the two choices are probably as good as any.
Mark
On Sun, Feb 22, 2009 at 15:21, Michael D. Adams <mdmkolbe@gmail.com> wrote:
> First, thank you for putting this up. As an (amateur) implementor
> this gives me a better feel for what numbers I need to target.
>
> However, it would be nice if you could pick samples to test that might
> give a better feel for the performance parameters of normalization.
> The "nörmalization" test is good as it shows the performance of the
> fast-path. But the "No\u0308rmalization" test doesn't really give a
> good feel for performance as the last eleven characters may or may not
> have been fast-pathed. Perhaps a few more points varying from
> completely unfast-pathable (e.g.
> "o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308") to
> somewhat fast-pathable might be more helpful.
>
> Michael D. Adams
> mdmkolbe@gmail.com
>
> On Thu, Feb 19, 2009 at 1:08 PM, Mark Davis <mark@macchiato.com> wrote:
> > In response to questions from some people in the W3C, I put together an
> FAQ
> > on NFC normalization, at http://www.macchiato.com/unicode/nfc-faq
> >
> > I have some figures on performance and footprint in there as examples; if
> > anyone else has figures from other implementations, I'd appreciate them.
> >
> > Mark
> >
>
This archive was generated by hypermail 2.1.5 : Sun Feb 22 2009 - 18:28:37 CST