Re: About CGJ (was: Yerushala(y)im - or Biblical Hebrew)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jul 23 2003 - 23:19:09 EDT

  • Next message: Patrick Andries: "Re: Code Pages!"

    On Thursday, July 24, 2003 1:24 AM, John Hudson <tiro@tiro.com> wrote:

    > At 03:49 PM 7/23/2003, Peter Kirk wrote:
    >
    > > (Yerushala(y)im with CGJ) with different versions of Uniscribe (on
    > > Windows 2000). In each case CGJ is rendered as a square box in each
    > > of several fonts. This behaviour indicates that actually Uniscribe
    > > treats CGJ as a regular paintable character, but it is not
    > > implemented in the specific fonts. So, it seems that if the font
    > > designer makes the very simple changes which John Hudson mentioned,
    > > "ligating" CGJ with the preceding character, the CGJ solution to
    > > the Hebrew problem can be implemented very simply, with no changes
    > > to rendering software and simple changes to fonts.
    > >
    > > So where is the serious problem with this solution? I don't see
    > > one. Nor do the President and the Technical Director of the Unicode
    > > Consortium. Perhaps the only problem was a misunderstanding of the
    > > properties of CGJ, which I hope has now been resolved.
    >
    > That would be nice indeed. I'm going to test this, but will need to
    > add CGJ to my font first. I'll report back in a few days.
    >
    > As Peter Constable noted, though, we need to be sure that the use of
    > CGJ in this context is clearly defined and, most importantly, is not
    > going to conflict with other possible uses. Uniscribe may, in fact,
    > handle the character in a way that works now, but if so we need to
    > confirm that this is intentional and is not going to change.

    There's an interesting case with the <Greek Dialytika and Tonos>
    precomposed combining character <U+0344>. Its canonical
    decomposition is <U+0308, U+0301>, and it is excluded from
    canonical recomposition (so it is really a *compatibility character*
    that should not be present in any normalized form).

    However, its canonical decomposition into <COMBINING DIERESIS,
    COMBINING ACUTE ACCENT> who are both of combining class
    230 (Above), has an impact in renderers: they are supposed to stack
    one above the other, so the ACUTE ACCENT (oxia, tonos) should
    appear *above* the DIERESIS (Dialytika). But usage in Greek (similar
    cases occur with Vietnamese Latin letters with two above diacritics),
    show that they do not stack up, but above diacritics are really
    combined (the tonos accent is written in the middle of the two dots of
    the dialitika).

    So this is alredy a case where diacritics can (and should) ligate by
    default, and that a CGJ may be used to remove (?) this ligature of
    accents and instead use the vertical stack. If this is wrong, then
    how do you combine a macron with a dieresis? Normally they are
    shown one above the other, and using CGJ may make them appear
    side-by-side. If CGJ is to be ignored always in renderers, I don't
    understand its role for encoding sequences where the position of
    diacritics is important (for example <acute, CGJ, grave> and <grave,
    CGJ, accute>, which look similar to a "open circumflex" and a "open
    caron", but not as a "open greater sign above" and a
    "open lower sign above".

    More generally, the relative placements of multiple diacritics with the
    same combining class is currently not defined precisely, and I wonder
    if this could cause problems with some languages.

    An interesting case is COMBINING DOUBLE ACUTE ACCENT (U+30B),
    which is not canonically decomposed into a pair of acute accents...
    as if it was needed to remove the assumption that multiple combining
    diacritics above should stack up, and this character make them appear
    side by side (and even ligated a bit by horizontal kerning in most fonts).
    I wonder what would be the effect of using <ACUTE, CGJ, ACUTE>
    face to <ACUTE, ACUTE>

    If correct placement of diacritics must be specified, could we use the
    ideographic description characters to create those combining
    sequences with a more descriptive composition rule? I know it seems
    tricky but the current handling of Greek and Vietnamese requires some
    compromises in the way some combining characters are precomposed
    before being placed on a base letter according to its combining class.

    So what is the current use of the CGJ character, and why was it
    introduced? If a character is to be ignored in searches (as the combining
    sequences should be treated equally as not recognizable by actual readers
    I don't see the interest of using a CGJ. I think it has been introduced explicitly
    to override the default placement of combining characters according to their
    standard combining class, and so to make a visible distinction to the reader
    and so not to handle it as fully ignorable. What is then this distinction?

    -- 
    Philippe.
    Spams non tolérés: tout message non sollicité sera
    rapporté à vos fournisseurs de services Internet.
    


    This archive was generated by hypermail 2.1.5 : Wed Jul 23 2003 - 23:59:40 EDT