From: Michael D. Adams (
Date: Mon Feb 23 2009 - 00:06:26 CST
I think this depends on who your audience is and what your goals are.
If the point is to show that the average case cost of normalization is
relatively low then statistical samples representative of the actual
data out there are alright. (As you have done.) However the
measurements you have posted only show that "when the data can be
fast-pathed the implementations are fast".
However, as an implementer I want to know not just the average case
but a complete performance model. This means measuring on a number of
different sorts of input so that one can start to characterize the
performance given a variety of inputs. Measuring inputs that are
almost entirely fast-path provides no insight into this.
Even if you really do just want to show that normalization is cheap,
then I might still measure a worst-case(*) text for the sake of
scientific honesty (you already have a best case and average case
text). Either the results are still fast in which case the argument
that normalization is fast is even stronger, or the results are slow
and can be used to underscore the importance of the "99% are NFC"
data. In addition not showing the results for the worst-case(*) makes
things look suspicious so showing the non-fast-path results even if
they aren't very good gives the page more credibility.
Michael D. Adams
(*) Ok, maybe not true worst case where there is a string of thousands
of combining characters to be sorted. Those aren't just rare, they
never happen unless someone is messing with you on purpose (they are
impossible in non-artificial text). (Though on second thought those
results might still be interesting to show that NFC time wont blow up
when hackers start sending oddly formed text at you.)
On Sun, Feb 22, 2009 at 7:24 PM, Mark Davis <> wrote:
> The implementations I tested do revert to fast-paths where possible. Given
> the data:
> ~99.98% of web HTML page content characters are definitely NFC.
> Content means after discarding markup, and doing entity resolution.
> ~99.9999% of web HTML page markup characters are definitely NFC.
> Because so much of markup is plain ASCII.
> an illustrative sample simulating documents would be
> simulating content:
> 999,800 characters (82% being ASCII, then Cyrillic, Han, Arab, other Latin,
> ...) not needing normalization, and
> 200 characters needing normalization, and
> simulating markup:
> 999,999 characters (99.5% being ASCII, ...) not needing normalization, and
> 1 character needing normalization.
> However, since the main issue that the FAQ is aimed at is the normalization
> of identifiers (like XML Name), the two choices are probably as good as any.
> Mark
> On Sun, Feb 22, 2009 at 15:21, Michael D. Adams <> wrote:
>> First, thank you for putting this up. As an (amateur) implementor
>> this gives me a better feel for what numbers I need to target.
>> However, it would be nice if you could pick samples to test that might
>> give a better feel for the performance parameters of normalization.
>> The "nörmalization" test is good as it shows the performance of the
>> fast-path. But the "No\u0308rmalization" test doesn't really give a
>> good feel for performance as the last eleven characters may or may not
>> have been fast-pathed. Perhaps a few more points varying from
>> completely unfast-pathable (e.g.
>> "o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308") to
>> somewhat fast-pathable might be more helpful.
>> Michael D. Adams
>> On Thu, Feb 19, 2009 at 1:08 PM, Mark Davis <> wrote:
>> > In response to questions from some people in the W3C, I put together an
>> > FAQ
>> > on NFC normalization, at
>> >
>> > I have some figures on performance and footprint in there as examples;
>> > if
>> > anyone else has figures from other implementations, I'd appreciate them.
>> >
>> > Mark
>> >
This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 00:10:35 CST