Re: Unicode 5.0 decompositions of Balinese vowel signs with tedung

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Apr 14 2006 - 17:36:20 CST

  • Next message: Philippe Verdy: "Re: Beta version of Unibook 5.0"

    As an alternative, may be U+1B0E could be given a (<font>?) compatibility decomposition (but this would be strange, compatibility is implied when there's another standard with which Unicode wants to maintain backward compatibility.

    Anyway, the current diffs and Main UCD properties file in the online repository do not indicate (still) these errors.

    Note that the PDF draft charts are from Unicode 5.0 BETA1 (delta 1). I don't know (I did not look deeply for that) if this causes issues with the current BETA2 delta.

    I'm sure that these will be fixed as soon as the UTC has discussed this issue and agreed on the necessary changes.

    ---
    I've just changed the status I had for the "missing" scripts that are still in FPDAM 3 ballot at ISO SC1/JTC2/WG2, because only a few of them were finally considered ready and approved for inclusion in Unicode 5.0. I'll try ti look more deeply for the current status at ISO the next time (I had grouped all of them as if they were part of Unicode5.0). Sorry for the inconvenience.
    ---
    The most difficult to "validate" is the new Cuneiform script. It's hard to seeif the properties are all correct. But I am also surprised to see that there's no combining diacritic, and that all these characters are considered plain (non decomposable) letters.
    ---
    Regarding N'ko (U+07C0..U+07FF), I can't see any obvious error for now between the normative properties (gc=Mn, cc=230(above) or 220(below), bidi=NSM), and representative glyphs from the draft delta1 PDF charts. The combining diacritics on 07EB..07F3 are given the appropriate properties:the draft charts contain the dotted circle indication. But I wonder why most of these diacritics don't have canonical equivalence with the general diacritics also used in other languages (for example in Latin/Pinyin transcriptions of Chinese) to denote tone marks: macron, tilde, dot above, circumflex, right-pointing arrow head, reversed tilde, left-pointing arrow head, dot below, diaeresis.
    I know that this may be convenient for fonts to handle these diaritics in the same group, wihtout having to be in trouble with other Latin fonts that have the diacritics but not the N'ko letters, so giving them separate encoding seems reasonable for them (notably because they are tone marks and not letter modifiers like in Latin, except U+07F2 which marks nazalisation, U+07F3 whose name does not suggest anything else than just two dots above).
    Regarding the N'ko low and high tone apostrophes (U+07F4..U+07F5), given that they don't combine, I wonder what is the difference with the normal curly apostrophes encoded in the General Punctuation block. For me they look like new confusables, and wonder why they don't have canonical equivalences (or at least, compatibility decompositions). For IDN however, given that apostrophes are considered punctuation and not tone marks, they are ignorable so this may make trouble in N'ko if such tone marks (important for the corresponding African languages where tone is fundamental) are forbidden from domain names. With the assigned properties (gc=Lm, bidi=R) they look like other right-to-left letters.
    May be we'll see in some future, international Latin domain names using those N'ko apostrophe letters as a paliative for the general apostrophes that are currently excluded from IDN applications (but I wonder if this will work reliably with Latin/Greek/Cyrillic given that these letters have string right-to-left directionality; if BiDi is implemented correctly however, these apostrophes will appear next to other Latin/Greek/Cyrillic with strong directionality, so this won't be an issue for reordering, and those new code points will be good alternatives to display apostrophes on the adress bar of explorers after a domain name redirect, especially for Hebrew and Arabic domain names that currently don't have good ways to use apostrophes, given that the existing ones are mirrored punctuations and excluded from names); they may be (for example) used in Hebrew and Arabic to denote abbreviated names, for example in trademarks that mix semitic letters and regular digits with a apostrophe mark between these parts, instead of an ugly hyphen or absence of sign in the domain name.
    Are there guidelines sent to IDN-compatible registry maintainers, regarding these two new n'ko letters? Have the new n'ko letters been considered for addition in the Unicode informative list of confusables (notable those letters at U+07D7 and punctuation at U+07F6 that look like european digits)? (Such question is not immediately critical as they can be updated later,even after the Unicode 5.0 release, given that IDN registries do not include all new Unicode letters automatically in their acceptable subset, before investigation about each new character).
    Philippe.
    ----- Original Message ----- 
    From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    > You have excluded this one in your list:
    > 
    > * U+1B0E = <U+1B0D ; U+1B35>
    >  BALINESE LETTER LA LENGA TEDUNG (vocalic ll) =
    >  BALINESE LETTER LA LENGA (vocalic l) +
    >  BALINESE VOWEL SIGN TEDUNG (aa)
    > 
    > My opinion is that this is part of the set, even if the tedung takes a constextual ligatured form.
    > One could still want the non ligatured form by explictly coding <U+1B0E,ZWNJ,U+1B35>. It would still be read as a LA LENGA with TEDUNG, even if the TEDUNG is not ligatured.
    > 
    > The charts show the ligatured form of this tedung and so suggests that this is the prefered form, but it dow not change the fact that this is a ligature and not different from a LA LENGA and a normal TEDUNG joined on the right.
    > 
    > And the balinese name of U+1B0E is clear: it gives the interpretation for native Balineses and they may be confused by the fact that, without this canonical decomposition, the "LETTER LA LENGA TEDUNG" (vocalic long l) will be considered different from "LETTER LA LENGA" (vocalic l) followed by a "VOWEL SIGN TEDUNG" (long vowel mark). Are there reasons to keep these two sequences distint?
    > 
    > ----- Original Message ----- 
    > From: "Peter Constable" <petercon@microsoft.com>
    >> Philippe has found a bug: the minutes of Mtg 103 make clear (103-C10) that the properties for Balinese characters were to be as specified in L2/05-090, and those have canonical decompositions for these multi-part vowels.
    >> I've checked the UnicodeData.txt properties for Balinese, and the decomposition mappings are the only ones with errors. Here are the corrected entries for the affected characters:
    >> 
    >> 1B06;BALINESE LETTER AKARA TEDUNG;Lo;0;L;1B05 1B35;;;;N;;aa;;;
    >> 1B08;BALINESE LETTER IKARA TEDUNG;Lo;0;L;1B07 1B35;;;;N;;ii;;;
    >> 1B0A;BALINESE LETTER UKARA TEDUNG;Lo;0;L;1B09 1B35;;;;N;;uu;;;
    >> 1B0C;BALINESE LETTER RA REPA TEDUNG;Lo;0;L;1B0B 1B35;;;;N;;vocalic rr;;;
    >> 1B12;BALINESE LETTER OKARA TEDUNG;Lo;0;L;1B11 1B35;;;;N;;au;;;
    >> 1B3B;BALINESE VOWEL SIGN RA REPA TEDUNG;Mc;0;L;1B3A 1B35;;;;N;;vocalic rr;;;
    >> 1B3D;BALINESE VOWEL SIGN LA LENGA TEDUNG;Mc;0;L;1B3C 1B35;;;;N;;vocalic ll;;;
    >> 1B40;BALINESE VOWEL SIGN TALING TEDUNG;Mc;0;L;1B3E 1B35;;;;N;;o;;;
    >> 1B41;BALINESE VOWEL SIGN TALING REPA TEDUNG;Mc;0;L;1B3F 1B35;;;;N;;au;;;
    >> 1B43;BALINESE VOWEL SIGN PEPET TEDUNG;Mc;0;L;1B42 1B35;;;;N;;;;;
    


    This archive was generated by hypermail 2.1.5 : Fri Apr 14 2006 - 17:37:40 CST