RE: Yerushala(y)im - or Biblical Hebrew

From: John Hudson (tiro@tiro.com)
Date: Sun Jul 06 2003 - 20:22:52 EDT

  • Next message: Tex Texin: "Re: The character for 10**24 in Japanese numbers (jo)"

    At 16:15 06/07/2003, Peter Kirk wrote:

    >I have a couple of points to make now on this issue. First, it might
    >help to get an idea of the scale of the problem. In the WTS encoded text
    >of the BHS Hebrew Bible, which comes to 5.25 MB in UTF-8, so a million
    >or so vowel points, there are just 637 instances of two vowel points on
    >one consonant. Of these, 636 are the word Yerushala(y)im, in four
    >slightly different forms including two with the directional he suffix.
    >The one additional instance is in the word mittaxat in Exodus 20:4,
    >which has a double vowel for a rather different reason - alternative
    >pronunciations of the word.

    Thanks for the thoughtful analysis, Peter. Eli Evans and I have been
    documenting all of the unique mark sequences in the Michigan-Claremont text
    and WTS morphology database that are potentially incorrectly re-ordered in
    Unicode normalisation (I say potentially, because the fixed position
    combining classes may, by chance, not reorder some combinations of vowels).
    In addition to the <patah, hiriq> and <qamats, hiriq> double vowel
    sequences for Yerushala(y)im, the example you cite from Exodes 20:4
    involves two vowels with an interposed cantillation mark -- <qamata,
    etnahta, patah> -- which needs to be renderable both with and without the
    cantillation. The WTS morphology database also includes a <tsadi, sheva,
    hiriq> sequence (in 2 Ch 13:14, last word) that is not attested in either
    BHS or BHL; Peter Constable enquired about this, since it seemed that it
    might be an error, but the WTS editors assured him that it was intentional.
    One thing we have not checked yet is whether there are any attested
    examples of cantillation marks that normally appear to the left of vowels
    occuring to the right. This seems unlikely, but nothing would surprise me
    about Biblical manuscripts, and such mark ordering would be affected by
    normalisation so should be checked and, hopefully, confirmed not to be an
    issue.

    While I agree that the number of textual instances (in the known Ben Asher
    texts, at least) that are affected by the combining class problem is very
    small, and that re-encoding Hebrew vowels may be overkill as a solution,
    I'm not crazy about the proposed CGJ solution, because I'm not convinced
    that I'm going to see CGJ support any time soon. Given the small number of
    attested sequences that would be adversely affected by normalisation
    re-ordering, I'm beginning to favour the idea of encoding these sequences
    as individual characters. We'd probably only need three or four, plus a
    right meteg, to solve the problem, and rendering would work find with
    existing font and layout engine technologies.

    Of course, I still hold out the faint hope that bodies like W3C and the
    IETF will say it is okay for Unicode to correct the existing combining
    classes and actually fix the problem at source.

    John Hudson

    Tiro Typeworks www.tiro.com
    Vancouver, BC tiro@tiro.com

    The sight of James Cox from the BBC's World at One,
    interviewing Robin Oakley, CNN's man in Europe,
    surrounded by a scrum of furiously scribbling print
    journalists will stand for some time as the apogee of
    media cannibalism.
                             - Emma Brockes, at the EU summit



    This archive was generated by hypermail 2.1.5 : Sun Jul 06 2003 - 20:59:44 EDT