2011/7/1 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
> I wonder if anyone has some statistics on the use of CGJ. Its revised
> intended use was to disrupt collating sequences, but you may be right
> about its most frequent use being to disrupt canonical reordering. A
> few years ago I concluded it wasn't yet safe to type the Welsh place
> name Llan͏gollen with CGJ.
Interestingly, I can't have this name being rendered correctly in my
Chrome version on Windows 7; it just displays the occurence of CGJ as
a non-spacing dotted box, overwriting the surrounding characters "n"
and "g" so that the place is completely unreadable.
I just wonder why Chrome needs to display this control in such a
disruptive way (I have not checked with other browsers).
Why do you need CGJ between "n" and "g" ?
- Is that to make sure that they won't collate as a single element
"ng" but separately ? How is it different here from the collation of
"language" where the situation would be similar?
- Or do you intend to do the reverse, i.e. effectively collate "ng" in
"Llangollen" as a single element?
Sorry I don't know Welsh, all I know is that "ng" is a digram of its
alphabet, which also includes "n" and "g" as separate letters... Other
digrams are "dd" contrasting with isolated "d", "ff" contrasting with
isolated "f", "ll" contrasting with isolated "l", "ph" contrasting
with isolated "p" and "h", "rh" contrasting with isolated "r" and "h",
and finaly "th" contrasting with isolated "t" and "h".
Those Welsh digrams are not exceptional, you'll find them in many
other Latin-based languages, except that they are not considered as
single letters in their alphabets. Welsh is very near from Breton, but
the latter still lists much fewer digraphs/trigraphs (such as "ch" and
"c’h").
French or English for example use a lot of digrams as well but due to
the huge number of lexical imports from various etymologies, these
languages have not attempted to fix a rule in their alphabet for
digraphs, and so it just list letters as separate.
The digram analysis requires contextual analysis of phonology and
morphology, including dictionary lookups to fix the correct
hyphenation. Such contextual lexical lookup is probably needed as well
in Welsh, that certainly borrows lots of English words today.
If your intent is to indicate to a word hyphenator some "don't break
here" condition (in the middle of an exceptional digram), or "break
allowed here" (in the middle of what the language alphabet generally
considers as an unbreakable digram), there are probably better
controls (other kinds of joiners/disjoiners) than CGJ to specify that.
[There exists some C1 controls inherited from ISO 8859-1 and EBCDIC,
except that these C1 controls have very poor support and various
incompatible system-specific usage, or would not be allowed in
transport layers, or could be considered invalid by some document
technical parsers. Another well supported control is the SOFT HYPHEN
which explicitly encodes a "break allowed here", and that you could
insert just before the "ng" digram in "Llangollen" if it is
effectively a digram in this context.]
Received on Sat Jul 02 2011 - 09:05:13 CDT
This archive was generated by hypermail 2.2.0 : Sat Jul 02 2011 - 09:05:19 CDT