Re: Why people still want to encode precomposed letters

From: philip chastney ([email protected])
Date: Mon Nov 24 2008 - 04:17:05 CST

Next message: philip chastney: "Re: Why people still want to encode precomposed letters"

Previous message: John Hudson: "Re: Why people still want to encode precomposed letters"
In reply to: Karl Pentzlin: "Re: Why people still want to encode precomposed letters"
Next in thread: John Hudson: "Re: Why people still want to encode precomposed letters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

--- On Sun, 23/11/08, Karl Pentzlin <[email protected]> wrote:

From: Karl Pentzlin <[email protected]>
Subject: Re: Why people still want to encode precomposed letters
To: [email protected]
Cc: "Unicode Mailing List" <[email protected]>
Date: Sunday, 23 November, 2008, 10:48 PM

Am Sonntag, 23. November 2008 um 22:01 schrieb philip chastney:

pc> A couple of quick questions. First, about how long would the list of
pc> combinations be?
pc> if we take 32-ish Latin characters, 24 Greek and 36-ish Cyrillic
pc> characters, and double that for upper and lower case, we have 144
potential base characters
pc> Combining Diacritical Marks (0300~036F) lists 112 characters
pc> ...
pc> we can refine that figure
pc> Latin characters use about 40 marks, Greek perhaps half-a-dozen
pc> (if we count the cases where 2 marks are used) and Cyrillic about 12
pc> ( 32 × 40 ) + ( 24 × 6 ) + ( 32 × 12 ) = 1808 potential
pc> combinations per case, which gives us a tighter limit of 3,600
combinations

If you take into account that:
- a lot of people (e.g. linguists and writers of North American indigenous
languages) use to attach 3 diacritical marks onto a base letter,
- there are "double diacritics" which attach to arbitrary pairs of
base letters,
- there possibly will be "triple diacritics" which attach to
arbitrary
triplets of base letters,
this number gets somewhat higher.
not at all

the figure of ( 32 × 40 ) for Latin lowercase, is an upper limit -- i.e, it overstates the likely requirement

where information is sparse, the technique is to set upper and lower limits and try and refine them, to see how close you can get them

in this case, ( 32 × 40 ) twice = 2560 -- that's an upper limit

the number of Latin-based combinations already included in TUS is 500~600 -- that's a lower limit

note that the lower limit is approximately 20~25% of the upper limit -- i.e, they are within a decimal order of magnitude

in this case, the number of double and triple diacritics found in North American indigeneous languages could be 3× the number of composites already included in TUS, without busting that upper limit -- I think that limit is safe

note that the double diacritics found in Vietnamese are already included in the 500~600 figure

you could incorporate an allowance for double and triple diacritics into that first WAG, but I really don't see the point -- it gives you no useful information

/phil

Next message: philip chastney: "Re: Why people still want to encode precomposed letters"
Previous message: John Hudson: "Re: Why people still want to encode precomposed letters"
In reply to: Karl Pentzlin: "Re: Why people still want to encode precomposed letters"
Next in thread: John Hudson: "Re: Why people still want to encode precomposed letters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Nov 24 2008 - 04:20:25 CST