From: Jony Rosenne (rosennej@qsm.co.il)
Date: Wed Jul 02 2003 - 10:42:11 EDT
I cannot agree with some of these statements. My comments are inserted.
Jony
> -----Original Message-----
> From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
> Sent: Wednesday, July 02, 2003 2:43 PM
> To: Jony Rosenne
> Cc: unicode@unicode.org
> Subject: Re: Yerushala(y)im - or Biblical Hebrew
>
>
> On Wednesday, July 02, 2003 12:55 PM, Jony Rosenne
> <rosennej@qsm.co.il> wrote:
>
> > I would like to summarize my understanding:
> >
> > 1. The sequence Lamed Patah Hiriq is invalid for Hebrew. It
> is invalid
> > in Hebrew to have two vowels for one letter. It may or may not be a
> > valid Unicode sequence, but there are many examples of
> valid Unicode
> > sequences that are invalid.
>
> Only invalid for Modern Hebrew.
No - it is true also for Biblical Hebrew and any other. The extra vowel
belongs to another letter, which is known to exist but isn't printed.
> In addition we are not
> discussing about the *validity* of the Unicode/ISO10646
> encoding (any Unicode string is valid even if it is not
> normalized, provided that it uses normalized codepoints, and
> respect a few constraints such as approved variant sequences,
> and valid usage of surrogate code units, but forbidden use of
> surrogate codepoints).
I tried to say that although it may be valid Unicode, it is not valid
Hebrew.
>
> The issue created by the Unicode normalization of text which
> is NOT required for Unicode encoding validity, but only for
> text processing (notably with the legacy HTML and SGML or the
> newer XML, XHTML and related standards based on XML).
>
> You have not understood the issue with *Traditional Hebrew*
> where there are actually two or more vowels for one base
> letter notably in Biblic texts but certainly in many other
> manuscripts of the same epochs, and probably after and still
> today, as long as these important texts for the human culture
> have been (and will be) studied by scholars and searchers or
> interested people, whever they were (are or will be)
> historians, sociologists, economists, linguists, translators,
> theologists, religious adepts, or many other scientific
> searches in various domains studied since milleniums
> (including mathematics, astronomy, medecine...).
See above.
>
> What has been demonstrated here is that the current combining
> classes defined on Hebrew characters were not needed for
> Modern Hebrew (which could have been written perfectly with
> all vowels defined with CC=0), but encoded with "randomly
> assigned" combining classes on vowels (for which the 220 and
> 230 classes were not usable).
Unicode Hebrew points and cantillation marks were defined with Biblical
Hebrew in mind.
>
> The initial encoding may have been done by studying some
> fragments only of the traditional texts, which exposed some
> combinations of vowels, and without really searching in such
> important traditional texts such as the Hebrew Bible (and
> also certainly in some old versions of the Torah, or some old
> translations to Hebrew of the Coran, or of famous Roman
> Latin, Greek, Phenician, or Syriac manuscripts, in a
> Middle-East region that has seen a lot of foreign invasions
> and been in the crossroad of all most famous cultures and
> commercial roads). For all vowels for which there did not
> seem to exist a demonstrated preference order (in the studied
> fragments of text), the combining classes have been mostly
> defined in a order matching the codepoint order in the legacy
> 8-bit encodings, thinking that occurences of those owels
> would be rare and would not cause problems.
There are no such cases, barring misunderstandings.
>
> When there will be new old scripts added in Unicode, I do
> think that Unicode should not make assumptions from a small
> set of text fragments: further researches may demonstrate
> that a definition of non-zero combining classes would
> introduce too much problems to allow encoding new texts, for
> which an existing normalization would incorrectly swap
> combining letters and change the semantic of the encoded
> text. These old texts should be handled assuming that the
> typist which entered and encoded them was correct in its
> transcription, and a NF* normalization should not change this
> decision automatically, as it would frustrate all the efforts
> performed by the transcripter to produce an accurate
> transcript of the encoded text.
>
> I think that if there are some reasons to define some
> combining classes for the normalization of some categories of
> text, we should accept to sacrifice the unification of
> characters, each time it will cause a problem, or Unicode and
> ISO10646 should accept to define/assign a generic codepoint
> with class "Mn", CC=0, whose only role will be to bypass the
> currently assigned non-zero CC value of combining characters,
> even if, temporarily, this causes some problems for text
> rendering engines (which can be corrected later to consider
> this character as ignorable for all rendering purpose,
> including searches of possible ligatures).
>
> I suggest that such codepoint be allocated in the U+03XX
> block for generic combining characters, so that it can be
> used in any script, including the existing ones. This
> character would be named "Combining Variant Selector" (CVS),
> it would preserve the semantic of the diacritic to which it
> is prefixed, and it would not override the current semantic
> of the "Combining Grapheme Joiner" (CGJ) that may have
> specific usage to create ligatures between diacritics, and
> that should still continue to be canonically ordered, so that
> if the diacritic <A> has a CC=a and diacritic <B> has a CC=b,
> and if (a < b), the sequence <A,CGJ,B> would be valid, but
> not <B, CGJ, A> unless the combining class of A is overriden
> with <B, CGJ, CVS, A>.
>
> This definition preserves the current semantic of the CGJ
> (without extending it too much in a way that was not intended
> when it was defined), and it makes possible to define
> combining classes for the most usual cases of an encoded
> script, without compromizing the future, if more rare texts
> are discovered where the first unification works violate the
> old text semantics for normalization.
>
>
>
This archive was generated by hypermail 2.1.5 : Wed Jul 02 2003 - 10:39:04 EDT