Re: Combining Overstruck diacritics

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 29 2007 - 16:49:18 CDT

  • Next message: Kenneth Whistler: "RE: Geographical language data"

    Arne asked, and Jukka responded:

    > > Is it appropriate to use
    > > <i><U+0336>
    > > <I><U+0336>
    > > <l><U+0336>
    > > <L><U+0336>
    > > <u><U+0336>
    > > <U><U+0336>
    > > in an alphabet
    >
    > If you are designing a new alphabet, it is up to you to choose the
    > characters. Different choices have different implications. In particular,
    > dynamic composition is still problematic (if supported at all) in many
    > programs.
    >
    > > or should the precomposed ones (U+0268, U+0197, U+019A,
    > > U+023D, U+0289, U+0244) be used instead?
    >
    > They are _not_ precomposed characters, and there is no defined
    > relationship (within Unicode) between them and the sequences you
    > mentioned.

    This is correct. And in this particular case, particularly if you
    are attempting to standardize an existing orthography or creating
    a new one, I would strongly recommend using the six "preformed" (not
    precomposed) letters for barred i's, barred l's, and barred u's,
    rather than attempting to use the overstruck diacritics. Those
    are far more likely to be supported well, and will also have the
    advantage of being interoperable with similar specialized
    orthographies using the same letters.

    Among other things, generic search tools looking for barred i's,
    barred l's, etc., are more likely to give you exact matches if
    you use those exact letters, than if you depend on them making
    equivalences for base letters + overstruck diacritic sequences.

    One side note: although it may not be obvious from the text
    of the standard, the intent has been for U+0335 COMBINING SHORT STROKE
    OVERLAY to represent the "bar" of letters like barred i's,
    barred l's, barred u's, barred o's, and such, cited in the
    abstract. If you expect matching between existing barred letters
    (e.g. U+0180 LATIN SMALL LETTER B WITH STROKE, U+0268 LATIN SMALL
    LETTER I WITH STROKE, etc.) and sequences of base letters
    with an overstruck diacritic, then U+0335 is the diacritic
    of choice. This, despite the fact that for letters (such as u)
    with two (or more) vertical segments to be crossed, the bar is
    typically extended to make it longer than the shortish stroke
    of the bars for i's, l's, etc.

    U+0336 COMBINING LONG STROKE OVERLAY was intended more as the
    poor man's strikethrough diacritic, comparable to U+0305 COMBINING
    OVERLINE (instead of styled overlining) and U+0332 COMBINING
    LOW LINE (instead of styled underlining). Those are all
    wide diacritics that -- in principle at least -- connect to
    neighboring instances of the same diacritics if used in sequences.

    They contrast -- again by design, although not always in practice
    in all fonts -- with the short stroke diacritics:

    U+0304 COMBINING MACRON
    U+0335 COMBINING SHORT STROKE OVERLAY
    U+0331 COMBINING MACRON BELOW

    which do *not* connect from letter to letter.

    >
    > Unicode does not analyze and decompose letters with a stroke as containing
    > a diacritic mark. Instead, they are coded as separate characters. (I've
    > never seen an explanation to this, but it's certainly too late to change
    > such issues, and the decision is understandable if you consider how the
    > "stroke" in letters varies in shape.)

    That is exactly the issue. The combining diacritics have been
    encoded primarily by shape, but there is a gray area where the
    diacritics crossing letter forms interact in complex ways with
    individual letter shapes -- to the point where it becomes very
    difficult to pick out the exact shapes of diacritics as individual
    parts. The ambiguity about the location and length of bars
    when used as diacritics for letters is a case in point. Also
    look at the varied placements *and* widths of the overstruck
    diacritic tildes: see U+1D6C .. U+1D76 for recently encoded
    examples. If you use a diacritic tilde overlay on an m, you have
    to make it substantially different from one used on a b or d,
    for example.

    There is no absolute line you can draw between clearly separably
    encodable diacritics such as acute and grave and ones which
    are not encoded. Note that in addition to bars, tildes, and other
    overlays, Latin letters have had a large number of other
    shape extensions and distortions used as diacritic modifications:
    hooks (of all shapes and positions), curls, swashes, ligations,
    Cyrillic-style angular bars, etc., but also inversions, reversals,
    and turnings (including 90 degree turnings as well as 180 degree
    turnings).

    The Unicode Standard had to draw the line someplace. A combining
    acute diacritic had to be encoded. A combining letter reversal
    diacritic could not be encoded. And the line was drawn between
    diacritics that typically touch an existing letter form but
    don't otherwise modify its shape (cedilla, ogonek, Vietnamese
    hook) and diacritics that completely overlay a letter form and
    which themselves take unpredictable shapes (bars, tildes, etc.).

    Some of the latter kinds of diacritics are separately encoded
    as combining marks, in part because it is useful to be able
    to talk *about* them in text or to use them in ad hoc
    combinations on occasion, but none of them are actually
    used in canonical decomposition mappings for precomposed
    letters.

    >
    > > Same applies to the LINE BELOW (U+0331 or U+0332?)

    Not U+0332. See above.

    >
    > No, that's a different issue, because there are precomposed character with
    > those characters as components.
    >
    > > Should <d><D><l><L><r><R><t><T> with line below used as combined
    > > diacritics, or as precomposed codepoints?

    I'd advise use of the precomposed code points, since they exist.
     
    > It depends. You need to consider the different factors. Unicode just tells
    > that there is canonical equivalence and there are various
    > normalization forms. On the practical side, depending on implementations
    > and not on the Unicode standard, the precomposed form (when available)
    > in better supported by software and results in better rendering. But there
    > are many factors that might make decomposed form more feasible.

    One of the advantages of using the precomposed forms for these letters
    is that they will be in NFC automatically, which means no change
    for data in this orthography when going in and out of systems
    or protocols which normalize data to NFC.

    Even if you need to make use of combining marks, as in your d and
    t with acute, it might make sense to run an evaluation of your
    orthographic recommendations, to see if they result in an orthography
    which is stable under Normalization Form C. (Of course you can't
    guarantee that for all text, because people could insert arbitrary
    combinations of other characters into the text -- but I'm talking
    about the orthography itself, by design.) Being in NFC from the
    start may give the orthography better behavior in some processing
    contexts.

    --Ken

    >
    > > I'm asking, because I need to use <d><D><t><T> with <U+0301> anyways to
    > > get the desired glyph...



    This archive was generated by hypermail 2.1.5 : Tue May 29 2007 - 16:51:39 CDT