From: philip chastney (philip_chastney@yahoo.com)
Date: Mon Nov 24 2008 - 04:17:05 CST
--- On Sun, 23/11/08, Karl Pentzlin <karl-pentzlin@acssoft.de> wrote:
From: Karl Pentzlin <karl-pentzlin@acssoft.de>
Subject: Re: Why people still want to encode precomposed letters
To: philip_chastney@yahoo.com
Cc: "Unicode Mailing List" <unicode@unicode.org>
Date: Sunday, 23 November, 2008, 10:48 PM
Am Sonntag, 23. November 2008 um 22:01 schrieb philip chastney:
pc> A couple of quick questions. First, about how long would the list of
pc> combinations be?
pc> if we take 32-ish Latin characters, 24 Greek and 36-ish Cyrillic
pc> characters, and double that for upper and lower case, we have 144
potential base characters
pc> Combining Diacritical Marks (0300~036F) lists 112 characters
pc> ...
pc> we can refine that figure
pc> Latin characters use about 40 marks, Greek perhaps half-a-dozen
pc> (if we count the cases where 2 marks are used) and Cyrillic about 12
pc> ( 32 × 40 ) + ( 24 × 6 ) + ( 32 × 12 ) = 1808 potential
pc> combinations per case, which gives us a tighter limit of 3,600
combinations
If you take into account that:
- a lot of people (e.g. linguists and writers of North American indigenous
languages) use to attach 3 diacritical marks onto a base letter,
- there are "double diacritics" which attach to arbitrary pairs of
base letters,
- there possibly will be "triple diacritics" which attach to
arbitrary
triplets of base letters,
this number gets somewhat higher.
not at all
the figure of ( 32 × 40 ) for Latin lowercase, is an upper limit -- i.e, it overstates the likely requirement
where information is sparse, the technique is to set upper and lower limits and try and refine them, to see how close you can get them
in this case, ( 32 × 40 ) twice = 2560 -- that's an upper limit
the number of Latin-based combinations already included in TUS is 500~600 -- that's a lower limit
note that the lower limit is approximately 20~25% of the upper limit -- i.e, they are within a decimal order of magnitude
in this case, the number of double and triple diacritics found in North American indigeneous languages could be 3× the number of composites already included in TUS, without busting that upper limit -- I think that limit is safe
note that the double diacritics found in Vietnamese are already included in the 500~600 figure
you could incorporate an allowance for double and triple diacritics into that first WAG, but I really don't see the point -- it gives you no useful information
/phil
This archive was generated by hypermail 2.1.5 : Mon Nov 24 2008 - 04:20:25 CST