Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below) from Szelp, A. Sz. on 2013-06-19 (Unicode Mail List Archive)

From: Szelp, A. Sz. <a.sz.szelp_at_gmail.com>
Date: Wed, 19 Jun 2013 15:34:07 +0200

The COMMAN BELOW / CEDILLA problem is typically something that probably
cannot be solved in Unicode in a way to satisfy every possible aspect.[^1]
These problems are an artifact of the historical development of Unicode,
and as a standard, stability issues seem to be high priority. Higher
priority usually, than canonical equivalences and NFD, especially as NFC is
the usually recommended form.

To fix these is probably a "to keep in mind" item for a hypothetical *
*NeoUniCode* standard of the future, as so many other issues. With modern
font technologies capable of language dependent glyph variants and markup
languages, a unification from the beginning might be a solution, or to
disunify the "both forms acceptable" from either (with the drawback of even
more confusables). However, these considerations are pretty academic and
hypothetical from a current Unicode point of view.

The case is similar with CARON / COMMA ABOVE RIGHT of Czech/Slovak, posing
probably an even harder case. Here one might consider for a hypothetical *
*NeoUniCode* standard encoding them as they canonically appear—with CARON
for uppercase and COMMA ABOVE RIGHT for lowercase and define
language-dependent casing behaviour, as it is already done with Latin SMALL
LETTER I and SMALL LETTER DOTLESS I / CAPITAL LETTER I and CAPITAL LETTER
WITH DOT ABOVE for Turkish in the current Unicode standard. (And while at
it, one could consider do away with separate code points for uppercase
letters altogether and resolve the issue with a mechanism similar to
combining characters or variation selectors).

/Szabolcs

[^1]: In fact, in languages where both presentations are equally
acceptable, even the (synchronic) identity is hard to determine: is it a
CEDILLE that can take COMMA form as well, or the other way around?

Szelp, André Szabolcs

+43 (650) 79 22 400

On Wed, Jun 19, 2013 at 2:41 PM, Denis Jacquerye <moyogo_at_gmail.com> wrote:

> On Wed, Jun 19, 2013 at 9:12 AM, Michael Everson <everson_at_evertype.com>
> wrote:
> > On 19 Jun 2013, at 07:54, Denis Jacquerye <moyogo_at_gmail.com> wrote:
> > [...]
> >> How would one rationalize using one diacritic U+0327 with M/m and O/o
> but not with L/l and N/n in Marshallese?
> >
> > The same way one would rationalize using precomposed ãẽĩñõũỹ (aeinouy
> with tilde) but a necessarily de-composed g̃ (g with tilde) in Guaraní.
>
> This is wrong: ãẽĩñõũỹ normalize to use U+0303 in NFD, so they
> canonically use the same tilde as g̃.
> The 4 additional non decomposable characters with Marshallese with
> cedilla would not normalize to use the same cedilla as the others
> Marshallese characters with cedilla. The would no canonically use the
> same cedilla.
>
> > [...]
> >> It would require less new characters to be encoded and would make it
> easier to support in fonts (adding 1 instead of 4).
> >
> > No! Because if you added a single new character you'd have to make sure
> you had good glyph placement with LlMmNnOo which is eight glyphs.
>
> The best practice would require to add diacritical mark placement
> whenever necessary if not on all possible base character, M/m and O/o
> would still need either way, L/l and N/n would need it for other
> combining diacritics either way.
> A modern font already needs to be able to correctly place combining
> diacritics, including cedilla or ogonek.
> Navajo and other languages need other placement of ogonek than that of
> European languages.
> This does not mean it is justified to encode single precomposed Navajo
> ogonek characters.
> The placement of the cedilla is not semantically different, m̧ with
> the cedilla on the left has the same meaning as if the cedilla were
> centered or on the right, even if just one of the two is correct in
> some contexts like in Marshallese.
> This does not mean it is justified to encode m with left cedilla, m
> with centered cedilla or m with right cedilla.
> An additional single combining diacritics would behave the same way.
>
> On Wed, Jun 19, 2013 at 9:49 AM, Michael Everson <everson_at_evertype.com>
> wrote:
> > On 19 Jun 2013, at 09:04, Denis Jacquerye <moyogo_at_gmail.com> wrote:
> >
> >> Furthermore, the cedilla can also have a proper cedilla form as opposed
> to the Latvian or Livonian comma below form in transliteration systems.
> >
> > This has nothing to do with the Marshallese/Latvian conflict, though.
> >
> >> ALA-LC romanizations use cedilla with r as they do under c or s.
> >
> > Does ŗ contrast with r̦ in ALA-LC romanization?
>
> The same way Marshallese has cedilla letters contrasting with comma
> below letters.
> The only correct form is with cedilla and it doesn't use comma below.
>
> >> BGN/PCGN and UNGEGN romanizations use cedilla with d as they do under
> h, s, t or z.
> >> DIN 1460-2 uses the cedilla under d, k, l, n as it does under c, h, s,
> t and z.
> >
> > If those things are a problem, then solving this problem for Marshallese
> simply does nothing about that problem. But it solves the problem for
> Marshallese.
> >
> >> If the 4 Marshallese cedilla characters are encoded as single
> characters, does this mean the d, k, l, r with proper cedilla in those
> romanizations would also have to be encoded as single characters?
> >
> > No; it doesn't have any implications for that data.
> >
> >> Encoding 1 combining diacritic character is more efficient than
> encoding 12 characters.
> >
> > Do you think that encoding one new COMBINING MARSHALLESE CEDILLA will
> not cause problems both with existing COMBINING COMMA BELOW and COMBINING
> CEDILLA?
>
> About the confusability, it is too late. Comma below, cedilla,
> palatalized hook below, ring half ring below and probably others are
> already confusable. Adding another will increase confusability but not
> to a relevant degree.
> Having 4 single characters will not make anything less confusable
> (using U+0327 with M/m and O/o but not with L/l and N/n is confusing)
> although it is a solution is does not solve the general problem of
> cedilla.
> If we don't want additional confusing characters maybe we should have
> CGJ, ZWJ or ZWNJ + combining cedilla (or any other similar sequence)
> to optionally differentiate the types of cedillas in Latvian,
> Livonian, Marshallese and romanizations.
>
> The issue of cedilla can easily be solved at a higher level, font
> technologies like OpenType can easily display glyphs in Latvian or
> Livonia and different glyphs for Marshallese.
>
> --
> Denis Moyogo Jacquerye
> African Network for Localisation http://www.africanlocalisation.net/
> Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/
> DejaVu fonts --- http://www.dejavu-fonts.org/
>
>
>
Received on Wed Jun 19 2013 - 08:37:40 CDT

This archive was generated by hypermail 2.2.0 : Wed Jun 19 2013 - 08:37:41 CDT