Re: Yerushala(y)im - or Biblical Hebrew

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 23 2003 - 18:07:12 EDT

  • Next message: Peter Kirk: "Re: Yerushala(y)im - or Biblical Hebrew"

    Peter Kirk cited Paul Nelson:

    > On 23/07/2003 03:20, Paul Nelson (TYPOGRAPHY) wrote:
    >
    > >Please look at the definition of GCJ and other such characters.
    > >Understand the differences between CGJ and ZWJ/ZWNJ.
    > >
    > >This discussion is very disturbing to me because after reading through
    > >the L2 document register it is unclear what is the difference between
    > >GCJ and ZWJ use.

    Things will get easier shortly when the full (final!) text of Unicode
    4.0 is posted online. The relevant discussion is in Section 15.2
    Layout Controls. Some excerpts:

    ===================================================================

    U+200D ZERO WIDTH JOINER is intended to produce a more connected
    rendering of adjacent characters than would otherwise be the case,
    if possible. ...

    U+200C ZERO WIDTH NON-JOINER is intended to break both cursive
    connections and ligatures in rendering. ...

                                           -- TUS 4.0, p. 390
                                           
    U+034F COMBINING GRAPHEME JOINER is used to indicate that adjacent
    characters are to be treated as a unit for the purposes of
    language-sensitive collation and searching. In language-sensitive
    collation and searching, the combining grapheme joiner should be
    ignored unless it specifically occurs within a tailored collation
    element mapping. ...

    For rendering, the combining grapheme joiner is invisible.
    However, some older implementations may treat a sequence of grapheme
    clusters linked by combining grapheme joiners as a single unit
    for the application of enclosing combining marks. ...

    The combining grapheme joiner must not be confused with the
    zero width joiner or the word joiner, which have very different
    functions. In particular, inserting a combining grapheme joiner
    between two characters should have no effect on their ligation or
    cursive joining behavior. ...

                                          -- TUS 4.0, p. 392
                                          
    ====================================================================

    > >The fact that you desire a control character to not be treated as such
    > >greatly concerns me.

    As Mark Davis pointed out, CGJ is *not* a control character, if
    by control character is meant gc=Cc (the ISO control characters)
    or gc=Cf (the Unicode format control characters). Its general
    category is Mn (with cc=0), which makes it formally a *combining mark*,
    not a control character.

    > >This really feels like people are trying to figure
    > >out any way to twist existing constructs to avoid fixing the
    > >normalization weights. I am alarmed from the implications of putting
    > >control characters in place to somehow subvert the normalization.

    There is no "subversion" of normalization involved here. Normalization
    continues to work just as it always has, with no changes. There is
    also no cause for alarm.

    I have been talking about CGJ because someone initially had
    suggested some kind of control character to adjust normalization
    or modify combining classes (which *would* be alarming and perverse),
    and then we cast around to figure out what would happen if
    any of the existing format control characters (such as ZWJ or ZWNJ)
    was inserted into these Hebrew vowel sequences.

    As it turns out, CGJ is just the ticket, because:

      A. It is not a format control character, but a combining mark.
      
      B. It is defined *not* to influence the format of neighboring
         characters.
         
      C. It is, itself, invisible.
      
      D. It is already in the standard. (since Unicode 3.2)
      
      E. It is defined, by default, to be ignored in searches --
         since it becomes significant in collation/searching only
         when tailored in combinations with other characters.
         
      F. Its combining class is zero.
         
      G. And most importantly, when inserted between two Hebrew
         points in a sequence, it has precisely the required
         effects for normalized Hebrew text, enabling the preservation
         of point ordering distinctions in normalized contexts.
         
     
    > As for the details of CGJ, please tell me where I can find a detailed
    > definition, and where it is specifically stated that a *rendering
    > engine* is obliged to process this *internally* as a control character -
    > and what precisely it is supposed to do with it if it does.

    There is no such obligation on a rendering engine.

    And if the implementers of rendering engines will simply "paint"
    instances of U+034F so that they become available to the font
    side of the rendering equation, then it should be relatively
    simple, as for the Biblical Hebrew point sequence cases, to
    get the <lamed, patah, CGJ, hiriq> sequences to display properly.

    > I am now
    > wondering if anyone understands what this character is supposed to be or
    > do. If this is not clearly defined anywhere, perhaps UTC needs to write
    > a clear definition. At least Ken Whistler seems to think that it is
    > appropriate for this use.

    Yes, I do -- as does Mark Davis.

    > Meanwhile, if despite this CGJ is not in fact
    > appropriate for this function, maybe we should propose a new character
    > which does have the appropriate properties.

    CGJ *does* have the appropriate properties. So proposing a new
    character would simply postpone resolution of the problem for
    Biblical Hebrew.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jul 23 2003 - 18:53:36 EDT