From: Dean Snyder (dean.snyder@jhu.edu)
Date: Wed May 25 2005 - 12:51:12 CDT
Gregg Reynolds wrote at 3:01 AM on Tuesday, May 24, 2005:
>Well I wouldn't argue against the utility of such an encoding; but
>unfortunately the "transliteration is lossy" argument works against you,
>for a very simple reason:
>
>*computational models of "characters" encode no "glyphic information"*
>
>None. Nada. Zipzilchzero. x0041 encodes Latin upper case A; it encodes
>an identity; it does not encode "glyphic information". Not even a set
>of glyphs. It's a theoretical impossibility. (btw Unicode has always
>been a bit confused about this.)
>
>And it's fairly easy to see this. There is no rule you can find that
>will tell you, for any given image, if it is a member of the set of all
>Latin upper case A glyphs. Pretty much any blob of ink can be construed
>as "A" in the right context. It's also impossible to enumerate all "A"
>glpyhs.
>
>(Idea for a contest: slap a blob of ink in a random pattern in an
>em-square; a sufficiently creative typeface designer will be able to
>design a latin font in which the blob will be recognizably "A". Free
>beer for a week to the best design.)
>
>So even if you encode your ancient scripts, you are not protected
>against the kind of lossiness you want to avoid. There's always a font
>and a rendering logic involved. You're lost as soon as you lay finger
>to keyboard and your idea of a glyph is transl(iter)ated into an
>integer. To guarantee correct decoding of a message in the way you
>(seem to) want, you would have to transmit specific glyph images along
>with the encoded message; in which case there's not much point of
>designing an encoding.
>
>Take a look at Douglas Hofstadter's essays on Metafont in "Metamagical
>Themas" for some fascinating discussion of such stuff.
This is all typical, sound-good, philosophical mumbo-jumbo originating
from wrong-headed escapes into irreality.
The word "abstract", as used in the phrase "abstract encoded
characters", does not mean arbitrary, random, chaotic - your blobs of
ink. If that were true, your email would be unintelligible.
No, in Unicode an abstract character is an association of a unique code
point with a unique name, a set of properties, and a unique-within-its-
subscript representative glyph; it's a sort of contract, or gentleman's
agreement, that makes possible the efficient and intelligible
interchange of encoded text. As such, each character (ignoring legacy
stuff) represents a SEMANTIC and GLYPHIC contrastive unit within its
script or sub-script. (I'm aware, of course, of edge cases like one and
el, zero and O, trema and umlaut, where context is used for
disambiguation. But these are extremely rare within a given script or
subscript.) I challenge anyone, for example, to show us ANY Arabic font
that does not have the exactly the same basic shape for "r" and
"z" (other than, of course, those playful fonts that are specifically
designed to mimic documents composed by cutting out printed letters from
different fonts). Such glyphic information is lost in transliteration,
but is retained for encoded characters in 99.99% of all existing fonts.
An abstract character is like a genotype, with variable renderings in
fonts, its phenotypes. Phenotypes form RECOGNIZABLE, CONTRASTIVE
CLUSTERS around their genotypes. Obviously the amount of stylistic
variability within any given phenotypical cluster is theoretically
infinite; but that does not mean that the variability is unbounded or
random. Playful, perverse, and accidental renderings of abstract
characters are the exceptions that only prove the rule - they are easily
recognized for what they are, non-phenotypical "mutations" - and they
are typically avoided. I haven't seen too many books, newspapers, or
websites published in dingbats. [Another way to look at this is that
perverse, playful, or accidental renderings of glyphs could not even be
recognized as such were it not for the existence of "core" renderings of
glyphs.]
There are whole industries, in the real world, built around the concept
of phenotypical clustering, industries involved in feature detection and
feature recognition. In the text arena its called optical character
recognition, and it DEPENDS upon the phenotypical clustering of the
renderings of abstract characters.
Read some OCR algorithms if you insist on thinking that "There is no
rule you can find that will tell you, for any given image, if it is a
member of the set of all Latin upper case A glyphs." The operative words
here are "rule" and "all". Just because you cannot formulate the rules
doesn't mean they don't or can't exist. Even though OCR algorithms
("rules") are not as good as the human brain at recognizing characters
from glyphs, they are becoming more and more sophisticated all the time,
continually approaching the ideal of recognizing all A's. So there are
rules - they're just very complex and haven't been completely formalized
yet. [By the way, this disparity between human and computer glyph
recognition is the basis for the various human-based glyph recognition
schemes used by several online services to verify that a respondent is
indeed a human. But here, again, the exception proves the rule - the
very success of such glyph-based schemes DEPENDS on the RECOGNIZABILITY
of those glyphs as phenotypical members of the clusters associated with
their genotypes, their abstract characters.]
***********************************
It seems that several people here have gotten hung up on my phrase
"transliteration is lossy" and it is partially my fault. What I have not
meant to imply, of course, is that encoding is lossless; that would be
silly, and I presumed would be self-evident to everyone. But I should
have made a more explicit statement, such as - "Transliteration is
orders of magnitude more lossy than encoding." I will say, however, that
in my original post to this thread I did make the statement (one, by the
way, that has been largely ignored) that, "Encoded scripts more closely
model autograph text and therefore either enable or greatly improve the
execution of these activities (without, of course, replacing the need
for the autopsy of original texts)." And that continues to be the main
reason why I think ancient scripts should be encoded and not JUST
transliterated.
Dean A. Snyder
Assistant Research Scholar
Manager, Digital Hammurabi Project
Computer Science Department
Whiting School of Engineering
218C New Engineering Building
3400 North Charles Street
Johns Hopkins University
Baltimore, Maryland, USA 21218
office: 410 516-6850
cell: 717 817-4897
www.jhu.edu/digitalhammurabi/
http://users.adelphia.net/~deansnyder/
This archive was generated by hypermail 2.1.5 : Wed May 25 2005 - 12:53:28 CDT