From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Mar 12 2008 - 05:01:49 CST
Karl Pentzlin wrote:
> Following the description in p.542 of TUS 5.0, the CGJ (i.e.
> U+034F COMBINING GRAPHEME JOINER) separates graphemes, e.g.
> in Slovak, it prevents a "ch" to be interpreted as a grapheme.
> Thus, the CGJ splits or separates, but does not "join" in any case.
CGJ joins combining characters that wpould aotherwise bet part of separate
combining sequences, because its combining class is zero. This
zero-combining class is the interesting feature of CGJ because it allows the
canonical reordering to preserve the relative order of combining accents. It
is effectively used as a separator, but only for the purpose of delimiting
reorderable sequences during normalization.
However it still has its own identity, and thus a base character followed by
any number of combining characters or CGJ is not equivalent to the base
character alone. So in Slovak or any other language, C + CGJ would be a
default grapheme cluster, separated from the H that is encoded after it.
CGJ is not used there to "separate" the two sequences. In fact Slovak in
your example considers that a C followed by a H is a singlze letter, but it
does not "say" anything about C+CGJ which is a grapheme cluster very
distinct from C; this is only for that reason that it prevents the
*semantic* interpretation of the sequence as a "CH" digraph.
But even in this case, it does not prevent the possible formation of a
ligature, or kerning, or any contextual forms in highly decorated font
styles, or cursiven linking. For this reason, I do think that preventing the
interpretation of a digraph should really not used CGJ as a distinctive
encoding of the first letter of a candidate digraph; I'd rather use a
separate disjoiner between C and H, in order to preserve the semantic of the
first C.
Notably, your CGJ does not separate words, and it does not prevent
hyphenation (unlike digraphs where hyphenation would be preferably avoided
in the middle):
I would encode <C,SHY,H> for example if hyphenation is suggested (for
example when C and H are part of distinct syllables, something that could
happen in many languages permitting compound words and/or agglutination or
prefixes/suffixes), or <C,WJ,H> if this is a basic separation between the
two preserved grapheme clusters <C> and <H> that does not introduce a word
break.
Be warned when handling texts in languages treating pairs of letters as
digraphs as if they were a single letter; there are almost always many
exceptions. It would be preferable to use an explicit digraph joiner to mark
the letter pairs, but this is almost never encoded due to the frequency of
occurence of such digraph in such language where it is defined or viewed as
if it was a single letter.
But then tweaking the other exceptions by transtforming the first letter of
candidate digraphs and appending them a CGJ looks like a severe tweak: it
breaks the semantics if you do that on the final letter of a component
agglutinated/coumpound with a next element whose initial letter may create
an undesired digraph opportunity.
Can you give examples in Slovak where CGJ is really needed between C and H
to avoid the interpretation as a digraph? I've seen many more examples when
it was not CGJ but SHY (and not just in Slovak). It looks like this
"interpretation problem" only happens in languages that sort digraphs
differently in their tailored collation. In most case, collation ordering is
not specified or needed, and the encoding is left transparent, in order to
preserve the orthography and semantic of encoded morphemes (including within
compound words, or woith prefixes, suffixes, infixes).
Due to the increasing use of borrowed words, many languages have abandonned
the distinction of digraphs like CH and removed them from their "alphabet"
and recognize now morphemes only lexically: if this creates a real
ambiguity, an explicit hyphen may be written to make the distinction with
the interpretation as a single digraph.
This archive was generated by hypermail 2.1.5 : Wed Mar 12 2008 - 09:56:31 CST