From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 27 2003 - 22:47:34 EDT
Philippe Verdy continued:
> From: "Mark Davis" <mark.davis@jtcsv.com>
> > From: "Anto'nio Martins-Tuva'lkin" <antonio@tuvalkin.web.pt>
> > > On 2003.05.25, 00:00, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
> > > > even if the Dutch language considers it as a single letter, in a
> > > > way similar to the Spanish "ch"
> > >
> > > I see one major difference: When you apply extra wide inter-char
> > > distance, you (should) get, f.i.:
> > > K o r t r ij k and not K o r t r i j k
> > > but E l c h e and not E l ch e
> > > This is common practice in both spanish and dutch typography, ISTK.
> > > I was told in this forum that the surest way to keep this working in
> > > Unicode texts is to use "i<WJ>j" for Dutch and plain "ij" for other
> > > languages.
> >
> > Well, I don't know who told you, but WORD JOINER only affects
> > linebreak behavior, not intercharacter spacing.
>
> I think he meant <ZWJ> (the zero-width joiner) used as as markup to
> create a ligated variant of a pair of characters in some languages
> that offer two very distinct forms (I think about Brahmic scripts
> such as Devanagari)...
No, not ZWJ, either.
U+2060 WORD JOINER (WJ) impacts line breaking behavior -- not the
applicable concept here.
U+200D ZERO WIDTH JOINER (ZWJ) impacts cursive connection and/or
ligation -- not the applicable concept here.
U+034F COMBINING GRAPHEME JOINER (CGJ) is the relevant character.
From Unicode 4.0:
"U+034F COMBINING GRAPHEME JOINER is used to indicate that
adjacent characters are to be treated as a unit for the
purposes of language-sensitive collation and searching."
That function was deliberately limited by the UTC to the status
of such digraphs for searching and sorting, as that was the only
well-defined requirement for the character.
However, as this thread has hinted, there could, in principle,
be multilingual contexts where there would be other legitimate
reasons for treating a digraphic ij (as for Dutch) distinct from
a non-digraphic ij sequence (as for Spanish). That is the same
kind of argument which led to encoding of U+034F for collation.
One can imagine an implementation of automatic letterspacing,
such that a sequence marked explicitly as a digraph would not
expand, but that one not so marked would expand. But such
distinctions would only need to be made in the rather dubious
conditions of: A) Multilingual text that is also B) marked
explicitly for language and that also C) requires different
rules for letterspacing language-by-language. Under such
circumstances, you could indicate the differences for <ij>
either by making use of the U+0133 ij digraph character for
one and <i,j> for the other, or you could indicate the
differences by <i,CGJ,j> versus <i,j>. The first approach
would likely work more easily with existing software, but
results in a problematical representation of Dutch data.
The second is a more generic Unicode approach, but would
likely be ignored by most software.
In any case, the much more likely situation would be software
that did letterspacing for fine typography based just on
Dutch rules. It would not *need* any markup of <i,j>
sequences, since it would be looking for and special-casing
the sequences, anyway.
--Ken
This archive was generated by hypermail 2.1.5 : Tue May 27 2003 - 23:31:45 EDT