Re: Taiwan Aboriginal Languages and Unicode support

From: Arne Götje (高盛華) (arne@linux.org.tw)
Date: Mon Dec 25 2006 - 23:46:37 CST

  • Next message: Erkki I. Kolehmainen: "VS: Taiwan Aboriginal Languages and Unicode support"

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Doug Ewell wrote:
    > Arne Götje (高盛華) <arne at linux dot org dot tw> wrote:
    >
    >> 1. instead of the letter 'g', they use the letter 'nġ'. This is a
    >> separate letter and not a ligature. It gets sorted differently in Amis
    >> and Paiwan languages and when type processing, it needs to be handled
    >> as such.
    >>
    >> My idea would be to encode this letter as a seperate character, as it
    >> has its own semantic. We can put it probably into one of the existing
    >> Latin Extensions in Unicode.
    >
    > U+006E U+0121
    >
    > or, if both n and ġ are individual letters and can appear together with
    > a different semantic from the one you describe, and if collating tables
    > are tailored to take CGJ into account:
    >
    > U+006E U+034F U+0121
    >
    > See the often-cited examples of "ch" in Spanish and Czech. The fact
    > that two existing characters combine to make a single "letter" in an
    > orthography does not justify encoding the combination as a separate
    > character. Most of the existing examples where this was done in Unicode
    > were to achieve some 1-to-1 convertibility goal in Unicode 1.0, and do
    > not represent a precedent for future encoding.

    no, this is not the same. the 'ġ' letter does not exist in the alphabet,
    but 'nġ' is a separate letter an has to be treated as such. For example:
    when searching for 'n' in a document it is *not* appropriate that 'nġ'
    shows up.
    Also when typing and deleting the 'nġ' letter, it has to be removed as a
    whole.
    For sorting issues: it is *not* appropriate for 'nġ' to be sorted after
    'n'. See the links I posted earlier.

    So, this is clearly *not* a combination of two existing letters, but a
    letter on its own.

    >
    > See also the WG2 "Principles and Procedures" document, Annex G (page 31):
    > http://www.dkuug.dk/JTC1/SC2/WG2/docs/n3002.pdf
    >
    >> 2. With the character 'nġ': in Amis this character, like all others,
    >> can get an acute, grave or circumflex accent. While we can use
    >> combining accent sequences to produce such characters, for the 'nġ'
    >> the dot on the g needs to be replaced, similar like it does on the 'i'
    >> in European languages.
    >>
    >> I suppose we need to encode a letter 'dotless ng' for this, like we
    >> have with the 'i'.
    >
    > I don't remember if there is a generic way to make a combining mark
    > (such as an acute accent) apply to a group of two base letters (such as
    > n g), but that is the way to solve this problem, not by encoding another
    > precomposed combination.

    again: they are *not* two base letter but one 'nġ', where the dot gets
    replaced with the accent. Same issue like the 'i' in European languages.

    > The analogy with dotless-i is not sound; there were numerous legacy
    > character sets for Turkish that distinguished dotted-i from dotless-i,
    > and Unicode had to maintain 1-to-1 convertibility with those character
    > sets. The same situation does not apply to "ng".
    >
    >> 3. In Amis language the 'i' when it gets its acute, grave or
    >> circumflex accent, it keeps the i-dot in place and the accent gets
    >> stacked on top of the i-dot.
    >> However, fonts handling European scripts will probably take the i-dot
    >> away and replace it with the accent, rather than stacking the accent
    >> on top of it.
    >> Do we need to have a separate encoded 'i' for this different semantic
    >> purpose? Or is there a better way to solve this issue?
    >
    > U+0069 U+0307 U+0301
    > U+0069 U+0307 U+0300
    > U+0069 U+0307 U+0302
    >
    > This is what Lithuanian does, IIRC.

    If it should be this way, then I propose that all software shall be
    changed in the way, that when a base glyph has one ore more combining
    accents, the whole sequence shall be treated as *one* character, so,
    when deleting a combining accent all preceding characters up to the base
    character and following combining accents, which belong to the same
    sequence get deleted too. Otherwise text processing is a PITA. :(
    (Cursor cannot positioned correctly when using the mouse, easy to miss
    an combining accent when deleting another one.)
    And make this rule a compatibility rule! If Software does not follow
    this rule, it is not compatible to Unicode! Otherwise I don't know how
    to convince software developers of the importance of this issue. This
    would also be necessary for sorting algorithms. Either the accents get
    ignored when sorting (like in Amis), or they will be sorted as separate
    character entities, like in Paiwan.

    Cheers
    Arne
    - --
    Arne Götje (高盛華) <arne@linux.org.tw>
    PGP/GnuPG key: 1024D/685D1E8C
    Fingerprint: 2056 F6B7 DEA8 B478 311F 1C34 6E9F D06E 685D 1E8C
    Key available at wwwkeys.pgp.net. Encrypted e-mail preferred.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    iD8DBQFFkLc9bp/QbmhdHowRAn0GAKCnuNQ5QQ/9tlUVqZGai+///4L4mgCff7+M
    9VWybzSuYAsZo04ErUZpkNQ=
    =nCA8
    -----END PGP SIGNATURE-----



    This archive was generated by hypermail 2.1.5 : Mon Dec 25 2006 - 23:49:14 CST