From: Henrik Theiling (ht@theiling.de)
Date: Fri Apr 11 2008 - 04:38:30 CDT
Hi!
TR#34 states that all character and sequence names (except one pair
involving HANGUL JUNGSEONG O-E) will always be unique when space,
medial dash and the words LETTER, CHARACTER, and DIGIT are ignored.
When writing a character name recognition algorithm, I would like to
let the user be as concise as possible, yet without violating Unicode
rules, and without being in potential conflict with upcoming versions
of Unicode. As I understand it, the rule that LETTER, CHARACTER,
DIGIT, spaces, medial dash can be ignored in comparision try to
address this very idea.
I noticed that for some scripts, e.g. Khmer, character names are still
a mouthful. I also noticed that when I additionally ignored
CONSONANT, VOWEL, and INDEPENDENT, the Unicode names are still unique
and it would improve writing (at least) Khmer character names a lot.
I was wondering whether it would be feasible to tighten the condition
in TR#34 so that no upcoming Unicode versions had ambiguous names if
CONSONANT, VOWEL, and INDEPENDENT were ignored, too.
Of course, there may be more ignorable words, so the question is where
to stop. 'VOWEL' is in 360 words, which is more than 'CHARACTER',
which is in only 106. But CONSONANT and INDEPENDENT are relatively
seldom. Here are a few other words that occur very frequently that
can currently be ignored without ambiguity:
VOWEL in 360 names
CONSONANT in 66 names
INDEPENDENT in 19 names (seldom, but also a mouthful)
SYLLABICS in 630 names
LIGATURE in 508 names
FORM in 798 names
PATTERN in 297 names
For stability reasons, it would be very nice if we knew that upcoming
Unicode versions had the same nice unambiguity, because then I could
officially ignore those words so my users could enjoy more concise
character names.
Bye,
Henrik
This archive was generated by hypermail 2.1.5 : Fri Apr 11 2008 - 04:41:42 CDT