From: Doug Ewell (dewell@adelphia.net)
Date: Mon Dec 25 2006 - 22:10:19 CST
Arne Götje (高盛華) <arne at linux dot org dot tw> wrote:
> 1. instead of the letter 'g', they use the letter 'nġ'. This is a
> separate letter and not a ligature. It gets sorted differently in Amis
> and Paiwan languages and when type processing, it needs to be handled
> as such.
>
> My idea would be to encode this letter as a seperate character, as it
> has its own semantic. We can put it probably into one of the existing
> Latin Extensions in Unicode.
U+006E U+0121
or, if both n and ġ are individual letters and can appear together with
a different semantic from the one you describe, and if collating tables
are tailored to take CGJ into account:
U+006E U+034F U+0121
See the often-cited examples of "ch" in Spanish and Czech. The fact
that two existing characters combine to make a single "letter" in an
orthography does not justify encoding the combination as a separate
character. Most of the existing examples where this was done in Unicode
were to achieve some 1-to-1 convertibility goal in Unicode 1.0, and do
not represent a precedent for future encoding.
See also the WG2 "Principles and Procedures" document, Annex G (page
31):
http://www.dkuug.dk/JTC1/SC2/WG2/docs/n3002.pdf
> 2. With the character 'nġ': in Amis this character, like all others,
> can get an acute, grave or circumflex accent. While we can use
> combining accent sequences to produce such characters, for the 'nġ'
> the dot on the g needs to be replaced, similar like it does on the 'i'
> in European languages.
>
> I suppose we need to encode a letter 'dotless ng' for this, like we
> have with the 'i'.
I don't remember if there is a generic way to make a combining mark
(such as an acute accent) apply to a group of two base letters (such as
n g), but that is the way to solve this problem, not by encoding another
precomposed combination.
The analogy with dotless-i is not sound; there were numerous legacy
character sets for Turkish that distinguished dotted-i from dotless-i,
and Unicode had to maintain 1-to-1 convertibility with those character
sets. The same situation does not apply to "ng".
> 3. In Amis language the 'i' when it gets its acute, grave or
> circumflex accent, it keeps the i-dot in place and the accent gets
> stacked on top of the i-dot.
> However, fonts handling European scripts will probably take the i-dot
> away and replace it with the accent, rather than stacking the accent
> on top of it.
> Do we need to have a separate encoded 'i' for this different semantic
> purpose? Or is there a better way to solve this issue?
U+0069 U+0307 U+0301
U+0069 U+0307 U+0300
U+0069 U+0307 U+0302
This is what Lithuanian does, IIRC.
-- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages
This archive was generated by hypermail 2.1.5 : Mon Dec 25 2006 - 22:13:12 CST