Re: Case-folding dotted i from Philippe Verdy on 2013-02-01 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 2 Feb 2013 01:48:57 +0100

2013/1/31 Joó Ádám <adam_at_jooadam.hu>:
>> Blame the invention of the dot over the i, or the convention of omitting it
>> when adding accents, or the adoption much later of a specifically dotless i
>> into the Turkish alphabet...
>
> Or the invention of a soft accent, for that matter. If the dot would
> be explicitly encoded in all cases, no problem would arise.

It would be wrong. The soft dot initially did not exist ans appeared
only as a glyphic feature in some medieval calligraphy for the cursive
script). Today the presence of this soft-dot is not justified in most
languages as it carries absolutely no semantic and CAN safely be
omitted (even if most common non-cursive fonts still display it).

It is also quite common to have this soft-dot decorated and replaced
by something else, like a small heart , but in that case it carries a
supplemental semantic and should be explicitely encoded.

But why isn't there a COMBINING HEART ABOVE ? (most often this heart
is drawn manually with strokes and not filled, but a filled variant
would also exist and if it was encoded then we would have two
combining characters:
- COMBINING WHITE HEART ABOVE
- COMBINING BLACK HEART ABOVE

For usual Latin texts (except in Turkic alphabets), the soft-dot
should never be encoded as it is a pure typographic feature : the
soft-dotted small i (from ASCII) can equally be drawn with or without
the dot, as long as there's no other combining character above it.

The encoding for Turkic however SHOULD NEVER use the soft-dotted i
alone, it should be either the explicit dotless i, or the letter i
with a combining dot above (but for this one, it should have better
been encoded as dotless i + combining dot above, so that Turkinc
languages would have avoided all confusions but using only the dotless
i). But there's a long history now for using the soft-dotted i to
encode the hard dotted i used in Turkic alphabets, so both should be
treated as equivalent, even if they are not strictly canonically
equivalent, and this is problematic unless we use collation rules to
treat them equivalent for all levels except the last binary level for
a few applications that still want to make distinctions for a
multinigual context or when the language is not determined).

Let's keep the ASCII small i as it is : always soft-dotted, with an
optional dot above, which MUST disappear when there's any other
combining character above it or attached above. For all other cases,
where it MUST NEVER take show a dot, use the dotless i, and where it
MUST ALWAYS show the dot, use soft-dotted i+dot above preferably
(because this is the current practice, which also matches the Turkic
special casing rules in the UCD), or mostly equivalently dotless i +
dot above (knowing that it is a confusable which should be listed as
such in the auxiliary UCD file of confusables, because it is not
canonically equivalent and not even compatibility equivalent).

Same consideration for the soft-dotted j.
Received on Fri Feb 01 2013 - 18:51:15 CST

This archive was generated by hypermail 2.2.0 : Fri Feb 01 2013 - 18:51:15 CST