From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 28 2004 - 19:41:27 CDT
Peter Constable responded to Peter Kirk:
> > From: Peter Kirk [mailto:peterkirk@qaya.org]
> > Sent: Friday, May 28, 2004 1:40 PM
>
>
> > Well, I understood the semantic content of a text to be the meaning of
> > the words...
[Kirk continuing, to provide more context...
> > , not the indication of which script they are written in. ...
> > But a Hebrew or Moabite
> > word has the same meaning whether it is written with Hebrew or
> > Phoenician glyphs. That was my argument. Now you may wish to argue that
> > plain text is intended to convey more information than that, also the
> > information about what script it is written in, but again that begs the
> > question about the what is a script distinction. ]
Constable responded:
> Unicode encodes characters, not languages, not morphemes, not senses of
> words. The character semantics of "Sally" and of "Sally" transliterated
> into Hebrew are not the same.
This also struck me as a major misunderstanding in Peter Kirk's
note, which may underlie some of the problem this thread has
been having in coming to *any* conclusions whatsoever.
Take a look at page 343 of the Unicode Standard, which shows a
line from the Codex Argenteus in Gothic script. That line is
then *transliterated* into the Latin script, and a translation
is also given. Taking just the last word, we have the
Gothic:
<10340, 10342, 10330, 1033F, 10346, 10334, 10344, 10330, 1033F>
PAIRTHRA, RAIDA, AHSA, URUS, FAIHU, AIHVUS, TEIWS, AHSA, URUS
and the Latin:
<0070, 0072, 0061, 0075, 0066, 0065, 0074, 0061, 0075>
p r a u f e t a u
Now *whichever* way this is represented, this is still the *same*
Gothic *language* word, and it means the same thing: prophet.
However, the *Unicode* sense of the semantics of these strings
is different. Unicode semantics refers to the identity of the
encoded characters. The semantics of U+10340 is the 17th letter of
the Gothic alphabet (of the Gothic script), named PAIRTHRA.
The semantics of U+0070 is the 16th letter of the Latin (and
English) alphabet (of the Latin script), named 'pee' (or P).
The Unicode semantics of those two strings is distinct, regardless
of the fact that both represent the same word in the same
language.
Conformance to the Unicode Standard requires that processes
respect the (Unicode) semantics of such strings. That means
that if you are handed <10340, 10342, 10330, 1033F, ...> you
recognize that this is a sequences of characters in the Gothic
script as encoded in the standard -- not Devanagari or
Hangul, for example, or OCR symbols. If handed <0070, 0072, 0061, ...>
you must recognize that this is a sequence of characters in the
Latin script as encoded in the standard -- not Devanagari or
Hangul, or OCR symbols, or, for that matter, Gothic.
However, conformance to the Unicode Standard does not prevent
a process which is *aware* of the meaning of Gothic text,
either in some relative simple and straightforward way
(e.g. a transliterator) or in some deep and profound way
(e.g. a machine translator) from determining that there is
an *equivalence* to be made here -- in the first instance a
letter-by-letter equivalence between the two scripts, and
in the second instance a lexical equivalence between the
words represented and their meanings.
Now I suspect that the Semitic palaeographers in this discussion
are going to raise their eyebrows and assert that this whole
concept of "semantics" for the characters is tautologous
and meaningless. In essence an encoded character has a
distinct semantics in the Unicode Standard only and precisely
*because* it is encoded separately as a character. And the
exceptions are asserted to be exceptions by specification
of *canonical equivalence*, which equates the semantics of
either distinct sequences or those few instances where the
committee has effectively determined that the *same* character
was encoded more than once in the standard (for various
historical reasons).
Nevertheless, that is the way the standard works. It is, in
fact, the way *all* character encoding standards work -- the
nature of the issue is simply more profoundly obvious for
the Unicode Standard because of its intended universal
scope, which means it dabbles in dozens of scripts that no
other character encoding standard has ever attempted to
come to grips with.
Now the architectural issue for the encoding of the Gothic
script in the Unicode Standard is very closely analogous
to the situation that bears on the question of the encoding
of the Phoenician (~ Old Canaanite) script.
The *need* and prospective benefits for encoding Gothic as
a script distinct from Latin are roughly parallel to those
suggested for Phoenician. The prospective costs for scholars
involved in the study of Gothic text are roughly parallel
to those raised by the Semiticists: the need to fold any
Gothic text to the more usual Latin transliterations,
when encountered or when searching.
If this parallel is not apparent to people, then I submit
that you may not really understand the Unicode Standard,
its intent, or how the committees approach their encoding
tasks.
And no matter how many times Peter Kirk begs the question of
what is a script distinction, what it comes down to in
the Unicode Standard is that a script distinction is a
distinct encoding of a script, neither more nor less.
It does not correlate directly to a graphologist's or
palaeographer's definition (if they have one) of what
a script is, nor can it be defined, a priori, axiomatically.
It comes down to decisions about potential usefulness of
separate encoding of certain candidate collections of
related writing symbols, based on historical identity,
technical considerations of how various desired processes
may interact with the encoding choices, and input from
(sometimes competing) interested parties who may or may
not want a separate encoding for some entity, based
on the way they have traditionally interacted with
materials of relevance.
--Ken
This archive was generated by hypermail 2.1.5 : Fri May 28 2004 - 19:42:52 CDT