From: John Hudson (tiro@tiro.com)
Date: Sun Jul 06 2003 - 20:22:52 EDT
At 16:15 06/07/2003, Peter Kirk wrote:
>I have a couple of points to make now on this issue. First, it might
>help to get an idea of the scale of the problem. In the WTS encoded text
>of the BHS Hebrew Bible, which comes to 5.25 MB in UTF-8, so a million
>or so vowel points, there are just 637 instances of two vowel points on
>one consonant. Of these, 636 are the word Yerushala(y)im, in four
>slightly different forms including two with the directional he suffix.
>The one additional instance is in the word mittaxat in Exodus 20:4,
>which has a double vowel for a rather different reason - alternative
>pronunciations of the word.
Thanks for the thoughtful analysis, Peter. Eli Evans and I have been
documenting all of the unique mark sequences in the Michigan-Claremont text
and WTS morphology database that are potentially incorrectly re-ordered in
Unicode normalisation (I say potentially, because the fixed position
combining classes may, by chance, not reorder some combinations of vowels).
In addition to the <patah, hiriq> and <qamats, hiriq> double vowel
sequences for Yerushala(y)im, the example you cite from Exodes 20:4
involves two vowels with an interposed cantillation mark -- <qamata,
etnahta, patah> -- which needs to be renderable both with and without the
cantillation. The WTS morphology database also includes a <tsadi, sheva,
hiriq> sequence (in 2 Ch 13:14, last word) that is not attested in either
BHS or BHL; Peter Constable enquired about this, since it seemed that it
might be an error, but the WTS editors assured him that it was intentional.
One thing we have not checked yet is whether there are any attested
examples of cantillation marks that normally appear to the left of vowels
occuring to the right. This seems unlikely, but nothing would surprise me
about Biblical manuscripts, and such mark ordering would be affected by
normalisation so should be checked and, hopefully, confirmed not to be an
issue.
While I agree that the number of textual instances (in the known Ben Asher
texts, at least) that are affected by the combining class problem is very
small, and that re-encoding Hebrew vowels may be overkill as a solution,
I'm not crazy about the proposed CGJ solution, because I'm not convinced
that I'm going to see CGJ support any time soon. Given the small number of
attested sequences that would be adversely affected by normalisation
re-ordering, I'm beginning to favour the idea of encoding these sequences
as individual characters. We'd probably only need three or four, plus a
right meteg, to solve the problem, and rendering would work find with
existing font and layout engine technologies.
Of course, I still hold out the faint hope that bodies like W3C and the
IETF will say it is okay for Unicode to correct the existing combining
classes and actually fix the problem at source.
John Hudson
Tiro Typeworks www.tiro.com
Vancouver, BC tiro@tiro.com
The sight of James Cox from the BBC's World at One,
interviewing Robin Oakley, CNN's man in Europe,
surrounded by a scrum of furiously scribbling print
journalists will stand for some time as the apogee of
media cannibalism.
- Emma Brockes, at the EU summit
This archive was generated by hypermail 2.1.5 : Sun Jul 06 2003 - 20:59:44 EDT