RE: Digraph "ch" et al. [was: Questions about proposed character s]

From: Becker, Joseph (Joseph.Becker@pahv.xerox.com)
Date: Sun May 30 1999 - 14:12:06 EDT


> My native language, Slovak, uses the digraph "ch", yet I cannot find it
> anywhere in the Unicode standard. Ch, as used in Slovak (and, I believe,
in
> Czech), is not just two characters typed after each other. It is a
separate
> character.

Digraph "ch" is definitely a unit in Slovak, Czech, and a dozen other
languages (see below). Such digraph (and n-graph) combinations are not
coded in Unicode, with a very few compatibility exceptions. The reason is
simply that coding them would make text processing in these and other
languages more difficult. Check out section 2.1 of the Unicode book.

I'll append below a listing I once compiled of Latin digraphs/n-graphs, I'm
sure it is full of inaccuracies but it should give the idea. This list
omits combinations with diacritical marks and with apostrophe treated as a
letter. There is a list of similar length for Cyrillic.

Note that "ch" and many other combinations are digraphs *in English* (E ng l
i sh); though they are never counted as *letters* of English, they are
treated as units in many processes. I'll append some interesting old
comments by Glenn Adams on the subject.

Joe

--------------------------------

aa olddanish, oldnorwegian
ai yoruba
au yoruba
bj wendish
ch spanish, portuguese, malay, indonesian, polish, oldhausa, slovak, czech,
wendish, catalan, breton, welsh, javanese, bugotu, swahili, zulu, navajo,
choctaw, nahuatl, quechua, guarani, aymara, ido, interlingua
cs hungarian
cz polish, oldhungarian
dd welsh
dh albanian, irish
dj indonesian, slovene, javanese
dl zulu, navajo
dz polish, ewe, oldlatvian, navajo
ff welsh
gb yoruba, ewe
gc zulu
gh navajo
gj albanian
gn bugotu
gq zulu
gu catalan
gx zulu
gy hungarian
hh zulu
hl suto, zulu, chuana, choctaw
hw navajo
ie oldlatvian
ij dutch
jh guarani
kh malay, zulu
kp ewe
ks slovene
kw navajo
lh portuguese
lj serbocroatian, slovene, wendish
ll spanish, albanian, catalan, welsh, quechua
ly hungarian
mj wendish
nc zulu
nh portuguese
ng malay, indonesian, tagalog, visayan, welsh, javanese, maori, bugotu
ngg bugotu
nj indonesian, serbocroatian, albanian, slovene
nx zulu
ny malay, hungarian, catalan, zulu
oi yoruba
ph welsh, zulu, interlingua
pj wendish
qh zulu
qu catalan, interlingua
rh welsh
rj slovene, wendish
rr spanish, albanian
rz polish
sh malay, oldhausa, albanian, zulu, navajo, choctaw, ido
sj indonesian
sz polish, hungarian
szcz polish
th albanian, welsh, bugotu, interlingua
tj indonesian, slovene
tl suto, chuana, nahuatl
tr malagash
ts wendish, ewe, malagash
tsh zulu, navajo
ty hungarian
tz nahuatl
uw southkurdish
wh maori
wj wendish
xh albanian, zulu, navajo
zh albanian
zs hungarian

--------------------------------

Date: 8 Dec 92 18:10:36 PST (Tuesday)
Subject: Spanish letters "ch" and "ll"
From: <Glenn Adams>
To: <Wayne Pollock>
cc: <unicode>

> Date: Sun, 29 Nov 92 19:46:15 EST
> From: <Wayne Pollock>

> I just finished reading Denis Garneau's report, referenced in the Unicode
> standard, on searching and sorting to produce expected results depending
on
> the culture of the user (i.e., the sort order of the same sequence of
> letters is different if you are Americian or French). And I learned
> something new: that in Spanish there are multi-character letters, namely
> "ch" and "ll". These are apparerently not ligatures but true letters in
the
> language, and 'ch' would sort differently than the letter 'c' followed by
> an 'h'.

> A quick peruse of the Unicode standard ASCII, Latin, and Extened Latin
code
> blocks reveals these spanish letters are missing. Using two character
> codes (the 'c' and 'h', or two 'l's) instead of a single code for each of
> these letters doesn't seem to fit with what (little) I know of Unicode
> design; I thought all true letters from any script would merit their own
> codes?

I just read your question on this topic. If it hasn't already been pointed
out, I might mention that Unicode doesn't necessarily encode the atomic
units of writing systems; rather, it encodes symbols which can be used to
form such units. It also is not the case that 'ch' and 'll' are "letters"
of Spanish writing systems, even though they operate as atomic units for
some collating sequences -- it is not even universally true that Spanish is
collated in this fashion.

The term used by Unicode to describe these units -- 'ch' and 'll' -- is
"text element." The manner in which these elements are interpreted may
depend upon a number of factors, e.g., the operation being performed,
parameters of the operation, the language and orthography represented by the
data, and even the data itself. For example, the English words "cathouse"
and "cathode" treat the sequence 't' 'h' as two units in the first instance
and one unit in the second instance for the hyphenation operation.

Many other writing systems variously treat multiple symbols or forms as
representing one entity at some level of abstraction; for example,
Vietnamese often sorts 'ch', 'gi', 'gh', 'kh', 'ng', 'ngh', 'nh', 'th', and
'tr' as single units (depending on the dictionary).

I hope this helps some. The problem of deciding what is a "letter" is not
always simple. However, in general, combinations of basic symbols such as
those mentioned here are not considered to be "letters" in any analysis.
They are usually called 'digraphs' or 'bigraphemes'; one can also have
trigraphs and n-graphs on the same principle.

By the way, what do you think about 'qu' and 'ch' in English writing
systems. They are true digraphs in English since they always have an atomic
phonological interpretation (although not always the same one, e.g., chin,
chivalry, chiropractor, yacht -- notice that the first three of these occur
in exactly the same context: 'chi-'). Should they have a single character
encoding? If no, then why not?

Glenn Adams

--------------------------------



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT