RE: PH technical issues (was RE: Why Fraktur is irrelevant

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 28 2004 - 19:41:27 CDT

  • Next message: John D. Burger: "OT: Notice of Change to Unicode mail list posting"

    Peter Constable responded to Peter Kirk:

    > > From: Peter Kirk [mailto:peterkirk@qaya.org]
    > > Sent: Friday, May 28, 2004 1:40 PM
    >
    >
    > > Well, I understood the semantic content of a text to be the meaning of
    > > the words...

    [Kirk continuing, to provide more context...

    > > , not the indication of which script they are written in. ...
    > > But a Hebrew or Moabite
    > > word has the same meaning whether it is written with Hebrew or
    > > Phoenician glyphs. That was my argument. Now you may wish to argue that
    > > plain text is intended to convey more information than that, also the
    > > information about what script it is written in, but again that begs the
    > > question about the what is a script distinction. ]

    Constable responded:

    > Unicode encodes characters, not languages, not morphemes, not senses of
    > words. The character semantics of "Sally" and of "Sally" transliterated
    > into Hebrew are not the same.

    This also struck me as a major misunderstanding in Peter Kirk's
    note, which may underlie some of the problem this thread has
    been having in coming to *any* conclusions whatsoever.

    Take a look at page 343 of the Unicode Standard, which shows a
    line from the Codex Argenteus in Gothic script. That line is
    then *transliterated* into the Latin script, and a translation
    is also given. Taking just the last word, we have the
    Gothic:

    <10340, 10342, 10330, 1033F, 10346, 10334, 10344, 10330, 1033F>
     PAIRTHRA, RAIDA, AHSA, URUS, FAIHU, AIHVUS, TEIWS, AHSA, URUS
     
    and the Latin:

    <0070, 0072, 0061, 0075, 0066, 0065, 0074, 0061, 0075>
       p r a u f e t a u
       
    Now *whichever* way this is represented, this is still the *same*
    Gothic *language* word, and it means the same thing: prophet.

    However, the *Unicode* sense of the semantics of these strings
    is different. Unicode semantics refers to the identity of the
    encoded characters. The semantics of U+10340 is the 17th letter of
    the Gothic alphabet (of the Gothic script), named PAIRTHRA.
    The semantics of U+0070 is the 16th letter of the Latin (and
    English) alphabet (of the Latin script), named 'pee' (or P).
    The Unicode semantics of those two strings is distinct, regardless
    of the fact that both represent the same word in the same
    language.

    Conformance to the Unicode Standard requires that processes
    respect the (Unicode) semantics of such strings. That means
    that if you are handed <10340, 10342, 10330, 1033F, ...> you
    recognize that this is a sequences of characters in the Gothic
    script as encoded in the standard -- not Devanagari or
    Hangul, for example, or OCR symbols. If handed <0070, 0072, 0061, ...>
    you must recognize that this is a sequence of characters in the
    Latin script as encoded in the standard -- not Devanagari or
    Hangul, or OCR symbols, or, for that matter, Gothic.

    However, conformance to the Unicode Standard does not prevent
    a process which is *aware* of the meaning of Gothic text,
    either in some relative simple and straightforward way
    (e.g. a transliterator) or in some deep and profound way
    (e.g. a machine translator) from determining that there is
    an *equivalence* to be made here -- in the first instance a
    letter-by-letter equivalence between the two scripts, and
    in the second instance a lexical equivalence between the
    words represented and their meanings.

    Now I suspect that the Semitic palaeographers in this discussion
    are going to raise their eyebrows and assert that this whole
    concept of "semantics" for the characters is tautologous
    and meaningless. In essence an encoded character has a
    distinct semantics in the Unicode Standard only and precisely
    *because* it is encoded separately as a character. And the
    exceptions are asserted to be exceptions by specification
    of *canonical equivalence*, which equates the semantics of
    either distinct sequences or those few instances where the
    committee has effectively determined that the *same* character
    was encoded more than once in the standard (for various
    historical reasons).

    Nevertheless, that is the way the standard works. It is, in
    fact, the way *all* character encoding standards work -- the
    nature of the issue is simply more profoundly obvious for
    the Unicode Standard because of its intended universal
    scope, which means it dabbles in dozens of scripts that no
    other character encoding standard has ever attempted to
    come to grips with.

    Now the architectural issue for the encoding of the Gothic
    script in the Unicode Standard is very closely analogous
    to the situation that bears on the question of the encoding
    of the Phoenician (~ Old Canaanite) script.

    The *need* and prospective benefits for encoding Gothic as
    a script distinct from Latin are roughly parallel to those
    suggested for Phoenician. The prospective costs for scholars
    involved in the study of Gothic text are roughly parallel
    to those raised by the Semiticists: the need to fold any
    Gothic text to the more usual Latin transliterations,
    when encountered or when searching.

    If this parallel is not apparent to people, then I submit
    that you may not really understand the Unicode Standard,
    its intent, or how the committees approach their encoding
    tasks.

    And no matter how many times Peter Kirk begs the question of
    what is a script distinction, what it comes down to in
    the Unicode Standard is that a script distinction is a
    distinct encoding of a script, neither more nor less.
    It does not correlate directly to a graphologist's or
    palaeographer's definition (if they have one) of what
    a script is, nor can it be defined, a priori, axiomatically.
    It comes down to decisions about potential usefulness of
    separate encoding of certain candidate collections of
    related writing symbols, based on historical identity,
    technical considerations of how various desired processes
    may interact with the encoding choices, and input from
    (sometimes competing) interested parties who may or may
    not want a separate encoding for some entity, based
    on the way they have traditionally interacted with
    materials of relevance.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri May 28 2004 - 19:42:52 CDT