Re: Spacing diacritics in Greek Extended

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Feb 28 2001 - 21:12:49 EST


Nick Nicholas said:

> As you know, in the short term any texts out there in Unicode
> polytonic Greek use precomposed characters, as people are not waiting for
> the intelligent font engines of the future. To put texts in Unicode, they
> convert them from existing codings. In all of these existing codings, be
> they 8-bit or ASCII-based (Beta Code), a capital letter with diacritics
> (titlecase) is rendered as two glyphs: the diacritics, as a spacing glyph,
> and then the capital.
>
> Since people have no familiarity with single-glyph
> capitals-with-diacritics, they are doing the same with their precomposed
> Unicode glyphs, using the spacing diacritics at the bottom of Greek
> Extended. See for example
> http://www.fordham.edu/halsall/basis/thomais-uni.html : the diacritics in
> section 5, at least, are separate glyphs.
>
> Unicode allows these spacing diacritic glyphs, but the Standard says that
> "unless information is present to the contrary", they should be
> interpreted as SPACE + non-spacing equivalent diacritic (Unicode 3.0,
> p.169-170). Would it be expedient to change this to having it postmodify the
> next character, as a legitimate legacy concern (which is why the
> precomposeds are there in the first place?)

No, if what you mean is a mechanical change of interpretation of such
a sequence, so that the Unicode Standard would specify that:

1F0A (for example) = <1FCD, 0391> = <0391, 0313, 0300>

The intermediate node of that equivalence would be totally out of
whack for Unicode, formally, since it decomposes instead to:

<0020, 0313, 0300, 0391>

i.e., not the same as the recursive decomposition of 1F0A.

What the text on pp. 169-170 says, in full is:

"Decomposition of [Greek Diacritic] Spacing Forms. When decomposing
the spacing forms, the spacing status of the implied usage must be
taken into account. Unless information is present to the contrary,
these spacing forms would be decomposed to U+0020 SPACE followed by
the nonspacing form equivalents shown in Table 7-2."

The exegesis of that passage is as follows.

If you are simply decomposing text by a general algorithm, as for
a Unicode Normalization Form (UAX #15), then you *must* use the
normative decomposition mappings, as specified by that algorithm.
I.e., <1FCD, 0391> normalized to NFKD is <0020, 0313, 0300, 0391>
and nothing else.

However, if you have "information present to the contrary", as would
be the case if you were doing intelligent conversion of polytonic
Greek, then it is perfectly o.k. to take a Unicode representation
of a compatibility sequence, i.e. <1FCD, 0391>, perhaps obtained by
a one-to-one mapping against an 8-bit implementation, and turn that
into the preferred Unicode representation of polytonic Greek,
i.e., <0391, 0313, 0300>. This is a knowing transformation of the
data from one form to another form by a process aware of these
equivalences. But that is comparable, for example, to doing
a transliteration from one form to another form, rather than being
a built-in normative equivalence defined by the Unicode Standard
itself.

>
> Fortunately the main online resource for converting into Unicode
> polytonic Greek (Sean Redmond's,
> http://www.jiffycomp.com/smr/unicode/convert.php3) is well-behaved in
> this regard.

Good. I expect for these kinds of issues smart implementers ought
to be able to "do the right thing". ;-)

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT