Re: Digraphs

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Feb 16 2000 - 22:25:26 EST

Next message: Edward Cherlin: "Identifying and supporting encodings (was Re: 8859-1,...)"
Previous message: Kenneth Whistler: "Re: UCS-4, UCS-2, UTF-16, UTF-8"
Maybe in reply to: Christopher John Fynn: "Digraphs"
Next in thread: Marco.Cimarosti@icl.com: "RE: Digraphs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Chris Fynn asked:

>
> How is it recommended to code Latin script digraphs that are used to
> represent a single letter?

By the appropriate sequence of characters representing the parts.

>
> For example in Roman translitteration of Indic languages the digraph "kh" or
> "Kh" occurs with
> a combining low line below (centred between the k and the h).
>
> see:
> http://ourworld.compuserve.com/homepages/stone_catend/trdis-4.htm
>
>
> Should this be entered as
> <k> <zero width joiner> <h> <combining low line>?

No.

The purpose of the ZERO WIDTH JOINER is not to create arbitrary
text elements out of some sequence of characters. It is to create
the appropriate context for the visual rendition of cursively joined
forms of glyphs (or, under the latest UTC decision, cues for the choice
of ligated forms from fonts).

Digraphs, and more generally, multigraphs, are examples of text elements
that vary from orthography to orthography. The same sequence of characters
may have an orthographic and phonologic status as a unit in one writing
system, and be simply a sequence of characters in another.

However, it would be inadvisable to try to mark this up in the plain text,
since that is as likely to cause misinterpretations of data as it is
to clarify anything. It should be up to higher-level protocols, on a language-by-
language (or orthography-by-orthography) basis, to interpret digraphs as
units or not.

After all, consider even English -- you don't have to get as exotic as
Latin transliteration of Devanagari extensions for Arabic to run into
the problem. In "this width", the two "th"'s are, of course, digraphs --
in fact two digraphs with different status, since they represent two
distinct phonemes in English. However, it would be inappropriate to
try to toss joiners or some other mechanism into the plain text representation
of that phrase to indicate the status of the digraphs.

Marking up text for morphological or phonological analysis is another issue --
then you can mark whatever you please. But that would be an example of what
the Unicode Standard means by a higher-level protocol. And for that, I
could choose to mark digraphs with brackets, or an equal sign, or substitute
them out with a phonetic symbol, or... whatever.

>
> If this is so, what about unaccented pairs like "kh", "gh", "ch", "jh", etc.
> which in transliteration of Indic languages similarly represent single
> letters?

It is exactly the same issue. The accents on the forms are not what is in
question.

John Cowan suggested, for this particular Indic transliteration example:

> Looks like <k> <combining low line> <h> <combining low line> to me.

And I concur, since this seems to involve a specialized use of the underscore
to indicate pharyngeal place of articulation (as Robert Wheelock explained).
Note also the h-underscore in the same list of standardized transliterations.

In other instances not involving pharyngeal articulation, but where a
visual indication of connectedness of two letters is intended in text,
the use of U+203F UNDERTIE or U+2040 CHARACTER TIE might be appropriate.

More extensive use of ties, including ranges across multiple characters
or phrases, should be dealt with by higher-level protocols, as for phrasing
in music.

Otherwise, for digraphs (or multigraphs) per se, we are just dealing with
the appropriate sequence of characters. Another example, from Northwest
American Indian languages, involving both a diacritic and a digraph, would
be:

U+0071 U+0313 U+02B7 (for an ejective labialized uvular stop)

This is a basic phoneme in the languages which have this sound, and as such
constitutes a "letter" of their alphabets as well -- but there is no particular
need to introduce some joining markup in the plain text in order to represent
the unit. The sequence noted above will work just fine for the text content.

>
> As far as simple rendering there is no need to link these pairs (other than
> to prevent line wrap or hyphenation) - but for processing of transliterated
> material they often should be treated as single entities.
>
> Is ZWJ appropriate or this another case for Michael's ZWL (even though they
> are not properly speaking ligatures)?

No. It is not appropriate, and does not make the case for either of these.

--Ken

>
> - Chris
>
>
>

Next message: Edward Cherlin: "Identifying and supporting encodings (was Re: 8859-1,...)"
Previous message: Kenneth Whistler: "Re: UCS-4, UCS-2, UTF-16, UTF-8"
Maybe in reply to: Christopher John Fynn: "Digraphs"
Next in thread: Marco.Cimarosti@icl.com: "RE: Digraphs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT