Re: A few questions about decomposition, equvalence and rendering

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Feb 06 2002 - 14:45:11 EST


Juliusz continued:

> KW> There is no good reason to invent composite combining marks
> KW> involving two accents together. (In fact, there are good reasons
> KW> *not* to do so.) The few that exist, e.g. U+0344, cause
> KW> implementation problems and are discouraged from use.
>
> What are those problems? As long as they have canonical
> decompositions, won't such precomposed characters be discared at
> normalisation time, hopefully during I/O?
>
> (I'm not arguing in favour of precomposed characters; I'm just saying
> that my gut instinct is that we have to deal with normalisation
> anyway, and hence they don't complicate anything further; I'd be
> curious to hear why you think otherwise.)

Perhaps I overstated the case slightly. It is true enough that if
you are working with normalized data, U+0344 gets normalized away:

% egrep 0344 NormalizationTest-3.2.0d6.txt
0344;0308 0301;0308 0301;0308 0301;0308 0301; # ... COMBINING GREEK DIALYTIKA TONOS

and you just end up with an otherwise typical sequence of combining marks.

However, the complication is in the statement of the algorithm,
where you end up having to talk about (and include in your tables)
the "Non-Starter Decompositions". See CompositionExclusions.txt, which
has a special section mentioning just these four oddballs:

# ================================================
# (4) Non-Starter Decompositions
# These characters can be derived from the UnicodeData file
# by including all characters whose canonical decomposition consists
# of a sequence of characters, the first of which has a non-zero
# combining class.
# These characters are simply quoted here for reference.
# ================================================

# 0344 COMBINING GREEK DIALYTIKA TONOS
# 0F73 TIBETAN VOWEL SIGN II
# 0F75 TIBETAN VOWEL SIGN UU
# 0F81 TIBETAN VOWEL SIGN REVERSED II

Note also that all four of these characters get "use of this character
is discouraged" notes in the Unicode names list.

These characters also result in a problematical edge case for
processing of the tables for the Unicode Collation Algorithm to
provide proper weightings.

> >> does anyone [have] a map from mathematical characters to the
> >> Geometric Shapes, Misc. symbols and Dingbats that would be useful
> >> for rendering?
>
> KW> As opposed to the characters themselves? I'm not sure what you
> KW> are getting at here.
>
> The user invokes a search for ``f o g'' (the composite of g with f),
> and she entered U+25CB WHITE CIRCLE. The document does contain the
> required formula, but encoded with U+2218 RING OPERATOR. The user's
> input was arguably incorrect, but I hope you'll agree that the search
> should match.
>
> I'm rendering a document that contains U+2218. The current font
> doesn't contain a glyph associated to this codepoint, but it has a
> perfectly good glyph for U+25CB. The rendering software should
> silently use the latter.
>
> Analogous examples can be made for the ``modifier letters''.
>
> I'll mention that I do understand why these are encoded separately[1],
> and I do understand why and how they will behave differently in a
> number of situations. I am merely noting that there are applications
> (useful-in-practice search, rendering) where they may be identified or
> at least related, and I am wondering whether people have already
> compiled the data necessary to do so.

I don't think so -- at least not officially within the Unicode
Consortium. This is concerned with shape similarities that go
beyond the kind of character folding implicit in the Unicode
Collation Algorithm.

The Unicode names list provides a considerable number of cross-references
for similarly-shaped characters and confusables, but this is, of
course, far short of a detailed listing that could be used as
the basis of a specification for shaped-based folding for search
purposes.

--Ken



This archive was generated by hypermail 2.1.2 : Wed Feb 06 2002 - 14:21:26 EST