From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Apr 11 2008 - 13:35:56 CDT
Henrik said:
> I noticed that for some scripts, e.g. Khmer, character names are still
> a mouthful. I also noticed that when I additionally ignored
> CONSONANT, VOWEL, and INDEPENDENT, the Unicode names are still unique
> and it would improve writing (at least) Khmer character names a lot.
>
> I was wondering whether it would be feasible to tighten the condition
> in TR#34 so that no upcoming Unicode versions had ambiguous names if
> CONSONANT, VOWEL, and INDEPENDENT were ignored, too.
As Mark indicated, this is always something you could formally
propose to the UTC as something for them to consider.
Personally, however, I would not be in favor of this kind of
change.
First, it further complicates the checking that has to be done
when new characters, formal name aliases, and named sequences are
proposed. Granted, this can all be done mechanically, but it
is already something that requires a specialized algorithm
not generally available (or often understood) by character proposers
or those reviewing the proposals.
Second, any such restriction would have to be written into
ISO/IEC 10646, as well as the Unicode Standard. I can tell
you from experience that it was a considerable problem getting
even the limited constraints now documented to consensus for
documentation in 10646, and getting that through ballots and
publication. National Bodies are (justifiably, I think) concerned and
worried about algorithmic constraints on their ability to
name things, particularly when the constraints get complicated
to the point that they can't remember all the details or
envision being able to check manually for uniqueness.
The requirement that the unique namespace include formal aliases
and named sequences, as well as character names per se, has
already pushed this constraint off the edge, in terms of the
degree of complication that the average standardizer will
tolerate.
> Of course, there may be more ignorable words, so the question is where
> to stop. 'VOWEL' is in 360 words, which is more than 'CHARACTER',
> which is in only 106. But CONSONANT and INDEPENDENT are relatively
> seldom. Here are a few other words that occur very frequently that
> can currently be ignored without ambiguity:
>
> VOWEL in 360 names
> CONSONANT in 66 names
> INDEPENDENT in 19 names (seldom, but also a mouthful)
> SYLLABICS in 630 names
> LIGATURE in 508 names
> FORM in 798 names
> PATTERN in 297 names
This illustrates the problem: where *do* you stop? I have run
into similar data from another point of view -- in examining
the Unicode names list for redundancies that allow creation
of specialized algorithms to pack it down into much smaller
storage without making use of generic compression algorithms
like LZW.
>
> For stability reasons, it would be very nice if we knew that upcoming
> Unicode versions had the same nice unambiguity, because then I could
> officially ignore those words so my users could enjoy more concise
> character names.
It is unlikely that the UTC or WG2 will depart significantly from
the patterns they already have in naming characters. And that
means that you'd likely be pretty safe in assuming you could
ignore (and or delete) such redundant terms when doing name
recognition.
But as an example of the pitfalls here, "VOWEL" and "LETTER"
could both be deleted out without loss of uniqueness, but
"VOWEL SIGN" cannot be. Vide: DEVANAGARI LETTER I versus
DEVANAGARI VOWEL SIGN I. But if you just omit "LETTER" and
"SIGN" in this case, you end up with shortened names that
aren't actually very a propos for Devanagari: DEVANAGARI I
versus DEVANAGARI SIGN I. More appropriate shortenings would
be to DEVANAGARI INDEPENDENT I versus DEVANAGARI I or
perhaps DEVANAGARI LETTER I versus DEVANAGARI MATRA I or
something else.
In general, I don't think that simple algorithmic transforms
on the Unicode names list do a very good job of creating
the most usable names for end users.
--Ken
This archive was generated by hypermail 2.1.5 : Fri Apr 11 2008 - 13:38:35 CDT