> My native language, Slovak, uses the digraph "ch", yet I cannot find it
> anywhere in the Unicode standard. Ch, as used in Slovak (and, I believe,
in
> Czech), is not just two characters typed after each other. It is a
separate
> character.
Digraph "ch" is definitely a unit in Slovak, Czech, and a dozen other
languages (see below).  Such digraph (and n-graph) combinations are not
coded in Unicode, with a very few compatibility exceptions.  The reason is
simply that coding them would make text processing in these and other
languages more difficult.  Check out section 2.1 of the Unicode book.
I'll append below a listing I once compiled of Latin digraphs/n-graphs, I'm
sure it is full of inaccuracies but it should give the idea.  This list
omits combinations with diacritical marks and with apostrophe treated as a
letter.  There is a list of similar length for Cyrillic.
Note that "ch" and many other combinations are digraphs *in English* (E ng l
i sh); though they are never counted as *letters* of English, they are
treated as units in many processes.  I'll append some interesting old
comments by Glenn Adams on the subject.
Joe
--------------------------------
aa  olddanish, oldnorwegian
ai  yoruba
au  yoruba
bj  wendish
ch  spanish, portuguese, malay, indonesian, polish, oldhausa, slovak, czech,
wendish, catalan, breton, welsh, javanese, bugotu, swahili, zulu, navajo,
choctaw, nahuatl, quechua, guarani, aymara, ido, interlingua
cs  hungarian
cz  polish, oldhungarian
dd  welsh
dh  albanian, irish
dj  indonesian, slovene, javanese
dl  zulu, navajo
dz  polish, ewe, oldlatvian, navajo
ff  welsh
gb  yoruba, ewe
gc  zulu
gh  navajo
gj  albanian
gn  bugotu
gq  zulu
gu  catalan
gx  zulu
gy  hungarian
hh  zulu
hl  suto, zulu, chuana, choctaw
hw  navajo
ie  oldlatvian
ij  dutch
jh  guarani
kh  malay, zulu
kp  ewe
ks  slovene
kw  navajo
lh  portuguese
lj  serbocroatian, slovene, wendish
ll  spanish, albanian, catalan, welsh, quechua
ly  hungarian
mj  wendish
nc  zulu
nh  portuguese
ng  malay, indonesian, tagalog, visayan, welsh, javanese, maori, bugotu
ngg bugotu
nj  indonesian, serbocroatian, albanian, slovene
nx  zulu
ny  malay, hungarian, catalan, zulu
oi  yoruba
ph  welsh, zulu, interlingua
pj  wendish
qh  zulu
qu  catalan, interlingua
rh  welsh
rj  slovene, wendish
rr  spanish, albanian
rz  polish
sh  malay, oldhausa, albanian, zulu, navajo, choctaw, ido
sj  indonesian
sz  polish, hungarian
szcz    polish
th  albanian, welsh, bugotu, interlingua
tj  indonesian, slovene
tl  suto, chuana, nahuatl
tr  malagash
ts  wendish, ewe, malagash
tsh zulu, navajo
ty  hungarian
tz  nahuatl
uw  southkurdish
wh  maori
wj  wendish
xh  albanian, zulu, navajo
zh  albanian
zs  hungarian
--------------------------------
Date:  8 Dec 92 18:10:36 PST (Tuesday)
Subject: Spanish letters "ch" and "ll"
From: <Glenn Adams>
To: <Wayne Pollock>
cc: <unicode>
> Date: Sun, 29 Nov 92 19:46:15 EST
> From: <Wayne Pollock>
> I just finished reading Denis Garneau's report, referenced in the Unicode
> standard, on searching and sorting to produce expected results depending
on
> the culture of the user (i.e., the sort order of the same sequence of
> letters is different if you are Americian or French).  And I learned
> something new: that in Spanish there are multi-character letters, namely
> "ch" and "ll". These are apparerently not ligatures but true letters in
the
> language, and 'ch' would sort differently than the letter 'c' followed by
> an 'h'.
> A quick peruse of the Unicode standard ASCII, Latin, and Extened Latin
code
> blocks reveals these spanish letters are missing.  Using two character
> codes (the 'c' and 'h', or two 'l's) instead of a single code for each of
> these letters doesn't seem to fit with what (little) I know of Unicode
> design; I thought all true letters from any script would merit their own
> codes?
I just read your question on this topic.  If it hasn't already been pointed
out, I might mention that Unicode doesn't necessarily encode the atomic
units of writing systems; rather, it encodes symbols which can be used to
form such units.  It also is not the case that 'ch' and 'll' are "letters"
of Spanish writing systems, even though they operate as atomic units for
some collating sequences -- it is not even universally true that Spanish is
collated in this fashion.
The term used by Unicode to describe these units -- 'ch' and 'll' -- is
"text element."  The manner in which these elements are interpreted may
depend upon a number of factors, e.g., the operation being performed,
parameters of the operation, the language and orthography represented by the
data, and even the data itself.  For example, the English words "cathouse"
and "cathode" treat the sequence 't' 'h' as two units in the first instance
and one unit in the second instance for the hyphenation operation.
Many other writing systems variously treat multiple symbols or forms as
representing one entity at some level of abstraction; for example,
Vietnamese often sorts 'ch', 'gi', 'gh', 'kh', 'ng', 'ngh', 'nh', 'th', and
'tr' as single units (depending on the dictionary).
I hope this helps some.  The problem of deciding what is a "letter" is not
always simple.  However, in general, combinations of basic symbols such as
those mentioned here are not considered to be "letters" in any analysis.
They are usually called 'digraphs' or 'bigraphemes'; one can also have
trigraphs and n-graphs on the same principle.
By the way, what do you think about 'qu' and 'ch' in English writing
systems. They are true digraphs in English since they always have an atomic
phonological interpretation (although not always the same one, e.g., chin,
chivalry, chiropractor, yacht -- notice that the first three of these occur
in exactly the same context: 'chi-').  Should they have a single character
encoding? If no, then why not?
Glenn Adams
--------------------------------
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT