From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Feb 23 2009 - 01:25:29 CST
On 2/22/2009 6:13 PM, Doug Ewell wrote:
> Mark Davis wrote:
>
>> an illustrative sample simulating documents would be
>>
>> simulating content:
>>
>> 999,800 characters (82% being ASCII, then Cyrillic, Han, Arab, other
>> Latin, ...) not needing normalization, and
>>
>> 200 characters needing normalization,
>
> If you did happen to run into some data that started out in NFD --
> say, generated on a Mac -- you'd have a lot more than 0.02% of content
> characters needing normalization.
I think it would be worthwhile to collect what I would call "reasonable
worst case" examples.
For an example to be reasonable, it would have to have data that are
typical for a certain language, with character distributions typical for
larger corpora in that language. It would also have to have a reasonable
assumption as for origin, for that, something like "NFD data created on
the Mac" would qualify, but for some languages there may be other data
formats that might require reordering, and which could be a worse case.
For "worst case" one would then pick one of the languages and one of the
data formats that give the most scope to actually exercise the
normalization part of the implementation algorithm. NFD and unnormalized
data might stress the implementation differently.
With such sample cases it would be possible to estimate "reasonable
worst case behavior" for various implementation strategies.
French data in NFD might require simple combination for about 10% of the
characters (very rough guess), but probably no reordering. Some South
Asian data in keyboard order might need reorderings, but for what
percentage of characters I can't estimate.
The point of such exercise would be to make sure that implementations
are fast enough when faced with data that for one reason or another,
happen to selectively be similar to one of these "reasonable worst cases".
A./
This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 01:27:10 CST