From: Peter Constable (petercon@microsoft.com)
Date: Fri May 28 2004 - 16:58:09 CDT
> From: Peter Kirk [mailto:peterkirk@qaya.org]
> Sent: Friday, May 28, 2004 1:40 PM
> Well, I understood the semantic content of a text to be the meaning of
> the words
Unicode encodes characters, not languages, not morphemes, not senses of
words. The character semantics of "Sally" and of "Sally" transliterated
into Hebrew are not the same.
> >Her "Phoenician words" in this case are probably something like her
> >name, or a transliteration of English words.
> Is it really in the scope of Unicode to encode such trivialities? I
have
> a key ring with my name "written" in an Egyptian hieroglyphic
> pseudo-alphabet. Will such abuse of Egyptian hieroglyphs have to be
> taken into account in the possible Unicode proposal for this script?
Why is that an abuse of hieroglyphs any more than Hebrew text
transliterated or transcribed in Latin characters, or Arabic text
transcribed in Hangul characters? Unicode is uninterested in what the
content of the text is; it encodes characters, not text. It is up to
users and implementers to decide what texts those characters can
represent.
So, absolutely, it is in the scope of Unicode.
> Children invent all kinds of alphabets in which to write their names;
> will all of these have to be encoded in Unicode?
The scenario did not involve children inventing an alphabet; it involved
students making a history presentation that touched on, among other
things the Phoenician script.
> Well, if anyone has another scenario to propose, let's see it.
Fine.
Scenario (undesireable):
The editor of a UCLA journal on ancient Indo-European linguistics
receives submissions from numerous sources for publication in the
journal. Certain formatting requirements are specified for submissions
wrt the kinds of document elements used, paragraph formatting and
overall page layout. As is often the case in similar situations,
however, no constraints are placed on fonts used. Submissions are
accepted in various file formats, included Word DOC, RTF, and certain
XML or SGML languages. Once approved for publication, submissions will
be converted to one common file format and will be typeset using one
collection of fonts.
Submissions regularly contain characters in a variety of scripts /
writing systems: Latin, Cyrillic, Old Italic, IPA, various Latin
transliteration schemes, etc. Very often, the submitted text is
formatted using fonts that the editor does not herself have. In some
cases, the submission is formatted but not consistently marked up; in
other cases, the text is marked up to identify document elements but not
formatted at all. Markup does not always identify the language of text
as in some cases the language may be unknown, or the text is an
analytical reconstruction and not in any actual, known language; and
because authors cannot be assumed to know how to do this with their
applications.
With some regularity, a submission makes reference to Phoenician
characters or includes examples in Phoenician-script text. Also, on rare
occasion, submissions will cite Hebrew-language words, which are
intended to be presented with square Hebrew glyphs. The Phoenician
characters have exactly the same encoded representation as the square
Hebrew text. As a result, fallback or default formatting will cause all
such text to appear with square Hebrew glyphs, and therefore before the
editor can provide a draft to her panel of reviewers, she must go
through a laborious process to carefully read each submission to ensure
that what she provides to reviewers has the intended presentation as
either Phoenician glyphs or square Hebrew glyphs. This add to her
workload in reviewing all submissions, and especially so for any
submissions that contain either Hebrew or Phoenician. On some occasions,
this leads to costly delays in publication. On some occasions, incorrect
glyphs are not spotted in proofs until after publication, requiring
additional work to add corrigenda to subsequent editions, and detracting
from the perceived quality of the journal as a whole.
Alternate scenario (desireable):
The editor receives submissions as described above. Because Phoenician
script and Hebrew script are encoded distinctly, there is never any
concern as to how text provided to reviewers will appear. She saves many
hours of work both in preparing submissions for reviewers and in final
typesetting. Embarrassing errors and the need to publish corrigenda are
significantly reduced.
Now tell me that's an unrealistic or trivial scenario.
> Well, I have used Shoebox and Toolbox. I have also used your company's
> products, which at least allow me to add a script name field to my
> database but don't allow me to tailor collations. But I was thinking
in
> terms of tailored collation weights for the Unicode collation
algorithm.
> These are much more complex than setting up a new language
configuration
> for Shoebox or Toolbox.
I suspect few Semitic paleographers are using MS database products.
Also, from what I have seen, it is not at all uncommon for researchers
in academia to have access to technology-support staff, including
programmers. Not necessarily in every case, but every time I've
interacted with someone associated with a university on such issues,
they have had access to some kind of support of this type. (That's one
of the things their funding requests are for.) Moreover, unless I'm
mistaken, the collation weights in this case would *not* be difficult to
deal with, and in addition there have already been offers to do that
work.
Moreover, the Semitic paleographers have indicated that their preference
is to encoded all of their text using the square Hebrew characters, so
the character-folding issue is at best an occasional concern that many
will never actually have to deal with.
I'm still completely unconvinced that the need for character folding is
a significant impediment.
Peter
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division
This archive was generated by hypermail 2.1.5 : Fri May 28 2004 - 16:58:41 CDT