From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Feb 23 2009 - 01:25:29 CST
On 2/22/2009 6:13 PM, Doug Ewell wrote:
> Mark Davis wrote:
>
>> an illustrative sample simulating documents would be
>>
>> simulating content:
>>
>> 999,800 characters (82% being ASCII, then Cyrillic, Han, Arab, other 
>> Latin, ...) not needing normalization, and
>>
>> 200 characters needing normalization,
>
> If you did happen to run into some data that started out in NFD -- 
> say, generated on a Mac -- you'd have a lot more than 0.02% of content 
> characters needing normalization.
I think it would be worthwhile to collect what I would call "reasonable 
worst case" examples.
For an example to be reasonable, it would have to have data that are 
typical for a certain language, with character distributions typical for 
larger corpora in that language. It would also have to have a reasonable 
assumption as for origin, for that, something like "NFD data created on 
the Mac" would qualify, but for some languages there may be other data 
formats that might require reordering, and which could be a worse case.
For "worst case" one would then pick one of the languages and one of the 
data formats that give the most scope to actually exercise the 
normalization part of the implementation algorithm. NFD and unnormalized 
data might stress the implementation differently.
With such sample cases it would be possible to estimate "reasonable 
worst case behavior" for various implementation strategies.
French data in NFD might require simple combination for about 10% of the 
characters (very rough guess), but probably no reordering. Some South 
Asian data in keyboard order might need reorderings, but for what 
percentage of characters I can't estimate.
The point of such exercise would be to make sure that implementations 
are fast enough when faced with data that for one reason or another, 
happen to selectively be similar to one of these "reasonable worst cases".
A./
This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 01:27:10 CST