Re: Combining Overstruck diacritics

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 29 2007 - 16:49:18 CDT

Next message: Kenneth Whistler: "RE: Geographical language data"

Previous message: Martin J. Heijdra: "RE: Geographical language data"
Maybe in reply to: Arne Götje (高盛華): "Combining Overstruck diacritics"
Next in thread: John Hudson: "Re: Combining Overstruck diacritics"
Reply: John Hudson: "Re: Combining Overstruck diacritics"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Arne asked, and Jukka responded:

> > Is it appropriate to use
> > <i><U+0336>
> > <I><U+0336>
> > <l><U+0336>
> > <L><U+0336>
> > <u><U+0336>
> > <U><U+0336>
> > in an alphabet
>
> If you are designing a new alphabet, it is up to you to choose the
> characters. Different choices have different implications. In particular,
> dynamic composition is still problematic (if supported at all) in many
> programs.
>
> > or should the precomposed ones (U+0268, U+0197, U+019A,
> > U+023D, U+0289, U+0244) be used instead?
>
> They are _not_ precomposed characters, and there is no defined
> relationship (within Unicode) between them and the sequences you
> mentioned.

This is correct. And in this particular case, particularly if you
are attempting to standardize an existing orthography or creating
a new one, I would strongly recommend using the six "preformed" (not
precomposed) letters for barred i's, barred l's, and barred u's,
rather than attempting to use the overstruck diacritics. Those
are far more likely to be supported well, and will also have the
advantage of being interoperable with similar specialized
orthographies using the same letters.

Among other things, generic search tools looking for barred i's,
barred l's, etc., are more likely to give you exact matches if
you use those exact letters, than if you depend on them making
equivalences for base letters + overstruck diacritic sequences.

One side note: although it may not be obvious from the text
of the standard, the intent has been for U+0335 COMBINING SHORT STROKE
OVERLAY to represent the "bar" of letters like barred i's,
barred l's, barred u's, barred o's, and such, cited in the
abstract. If you expect matching between existing barred letters
(e.g. U+0180 LATIN SMALL LETTER B WITH STROKE, U+0268 LATIN SMALL
LETTER I WITH STROKE, etc.) and sequences of base letters
with an overstruck diacritic, then U+0335 is the diacritic
of choice. This, despite the fact that for letters (such as u)
with two (or more) vertical segments to be crossed, the bar is
typically extended to make it longer than the shortish stroke
of the bars for i's, l's, etc.

U+0336 COMBINING LONG STROKE OVERLAY was intended more as the
poor man's strikethrough diacritic, comparable to U+0305 COMBINING
OVERLINE (instead of styled overlining) and U+0332 COMBINING
LOW LINE (instead of styled underlining). Those are all
wide diacritics that -- in principle at least -- connect to
neighboring instances of the same diacritics if used in sequences.

They contrast -- again by design, although not always in practice
in all fonts -- with the short stroke diacritics:

U+0304 COMBINING MACRON
U+0335 COMBINING SHORT STROKE OVERLAY
U+0331 COMBINING MACRON BELOW

which do *not* connect from letter to letter.

>
> Unicode does not analyze and decompose letters with a stroke as containing
> a diacritic mark. Instead, they are coded as separate characters. (I've
> never seen an explanation to this, but it's certainly too late to change
> such issues, and the decision is understandable if you consider how the
> "stroke" in letters varies in shape.)

That is exactly the issue. The combining diacritics have been
encoded primarily by shape, but there is a gray area where the
diacritics crossing letter forms interact in complex ways with
individual letter shapes -- to the point where it becomes very
difficult to pick out the exact shapes of diacritics as individual
parts. The ambiguity about the location and length of bars
when used as diacritics for letters is a case in point. Also
look at the varied placements *and* widths of the overstruck
diacritic tildes: see U+1D6C .. U+1D76 for recently encoded
examples. If you use a diacritic tilde overlay on an m, you have
to make it substantially different from one used on a b or d,
for example.

There is no absolute line you can draw between clearly separably
encodable diacritics such as acute and grave and ones which
are not encoded. Note that in addition to bars, tildes, and other
overlays, Latin letters have had a large number of other
shape extensions and distortions used as diacritic modifications:
hooks (of all shapes and positions), curls, swashes, ligations,
Cyrillic-style angular bars, etc., but also inversions, reversals,
and turnings (including 90 degree turnings as well as 180 degree
turnings).

The Unicode Standard had to draw the line someplace. A combining
acute diacritic had to be encoded. A combining letter reversal
diacritic could not be encoded. And the line was drawn between
diacritics that typically touch an existing letter form but
don't otherwise modify its shape (cedilla, ogonek, Vietnamese
hook) and diacritics that completely overlay a letter form and
which themselves take unpredictable shapes (bars, tildes, etc.).

Some of the latter kinds of diacritics are separately encoded
as combining marks, in part because it is useful to be able
to talk *about* them in text or to use them in ad hoc
combinations on occasion, but none of them are actually
used in canonical decomposition mappings for precomposed
letters.

>
> > Same applies to the LINE BELOW (U+0331 or U+0332?)

Not U+0332. See above.

>
> No, that's a different issue, because there are precomposed character with
> those characters as components.
>
> > Should <d><D><l><L><r><R><t><T> with line below used as combined
> > diacritics, or as precomposed codepoints?

I'd advise use of the precomposed code points, since they exist.

> It depends. You need to consider the different factors. Unicode just tells
> that there is canonical equivalence and there are various
> normalization forms. On the practical side, depending on implementations
> and not on the Unicode standard, the precomposed form (when available)
> in better supported by software and results in better rendering. But there
> are many factors that might make decomposed form more feasible.

One of the advantages of using the precomposed forms for these letters
is that they will be in NFC automatically, which means no change
for data in this orthography when going in and out of systems
or protocols which normalize data to NFC.

Even if you need to make use of combining marks, as in your d and
t with acute, it might make sense to run an evaluation of your
orthographic recommendations, to see if they result in an orthography
which is stable under Normalization Form C. (Of course you can't
guarantee that for all text, because people could insert arbitrary
combinations of other characters into the text -- but I'm talking
about the orthography itself, by design.) Being in NFC from the
start may give the orthography better behavior in some processing
contexts.

--Ken

>
> > I'm asking, because I need to use <d><D><t><T> with <U+0301> anyways to
> > get the desired glyph...

Next message: Kenneth Whistler: "RE: Geographical language data"
Previous message: Martin J. Heijdra: "RE: Geographical language data"
Maybe in reply to: Arne Götje (高盛華): "Combining Overstruck diacritics"
Next in thread: John Hudson: "Re: Combining Overstruck diacritics"
Reply: John Hudson: "Re: Combining Overstruck diacritics"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue May 29 2007 - 16:51:39 CDT