Just so story: Why isn't o-slash decomposed? (was: Re: Character folding in text editors)
kenwhistler at att.net
Mon Feb 22 15:19:56 CST 2016
On 2/21/2016 9:53 AM, Doug Ewell wrote:
>>> But that still doesn't work for a character like ø, which doesn't
>>> decompose to o + anything
>> Why doesn't it, btw? Same question about ł.
>> I've heard an opinion that UnicodeData.txt only included
>> decompositions when the combining mark's glyphs don't overlap those of
>> the basic character. Is that correct?
> This sounds like a great question for Ken Whistler. ☺
Well, with a softball pitch like that one... ;-)
The basics are described in TUS 8.0, Section 2.12, Equivalent Sequences,
on p. 65, in "Non-decomposition of Certain Diacritics."
As to the inevitable why? question. Well, the UTC had to draw a line
*somewhere* between clearly independent graphical combining
marks applied to clearly distinct bases, versus completely idiosyncratic
adjustment of base letter shape to create new letters. (For an
example of the latter, think U+025E LATIN SMALL LETTER CLOSED
REVERSED OPEN E, as a "sorta e-like character".)
The decision was made by the original architects of Unicode, back
at the point when the concept of decomposition was getting
formalized (circa 1991), to draw the line thus:
A. Clearly detached marks, plus a few attached marks at the
"periphery" of the base that have predictable positions and
do not distort the base letter shape (e.g., cedilla, ogonek, the
B. Overlaid marks (bars, slashes) and various hooks, curls,
and the Cyrillic descenders. These have fairly unpredictable
positions, so fallback displays tend to look bad, and the
effect on the base letter shape is also unpredictable for the
hooks and curls types of "diacritic" letter formation.
Also in this category were any turned, rotated, reversed,
or flipped letters.
Note that this line is not exactly the same as what the early
drafts (and the eventual Unicode 1.0) encoded for combining
marks, because a few of the most productive Latin overlaid
and attached combining marks were separately encoded.
This tends to be the root of most current confusion about
the topic for people coming at an attempt to understand
the Unicode Standard long after the initial decisions were
all engraved in stone. Having the overlaid diacritics (and
at least the phonetic hooks) separately encoded enabled
some productive use of them before further surveys resulted in
filling out the atomic encoding of Latin letters with bars and hooks
(see, e.g. Latin Extended-C and the Phonetic Extensions
Supplement for many examples). But actually, having separate
encoding of the overlaid diacritics, hooks, etc., is also useful
for other purposes -- for collation, for example, they provide
natural targets for assigning the secondary weights, which
then can be used for the artificially introduced decompositions
of letters with bars, letters with slashes, letters with hooks, etc.,
either for the DUCET or for tailorings which want to treat
such combinations as having secondary diacritic weights,
rather than as primary weight-distinct atomic letters.
More information about the Unicode