From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jul 08 2005 - 20:22:33 CDT
Gregg said:
> And as a matter of policy I see no reason why a *standards* body
> (especially an industry standard body) should have a requirement for
> native speaker participation; after all, the (industry-defined) goal is
> to get a standard, not to make everybody happy. No doubt such
> participation is desirable, but it's quite a different thing to say it's
> required. Standards have to work in the marketplace in order to become
> standards.
Correct. But as a by the way, and perhaps no surprise to you, historically
both the UTC and particularly WG2 have been extraordinarily polyglot,
both in terms of first languages, second languages, and languages
known by study.
>
> On the other hand, it's pretty obvious (to me at least) that
> participation of native speakers in standardization of cultural
> artifacts like written language is a Good Thing.
Yes. Oh, and for anyone following along here, an oft-overlooked section
of the standard is the Acknowledgments -- on this question in
particular for Unicode 4.0, see pp. vi-vii. Unlike ISO standards,
which, for various procedural reasons, spring unauthored and
unacknowledged from the brow of an ISO Secretariat, the Unicode
Standard has always made a serious effort to try to acknowledge
the many, many people from around the world who contributed in one
way or another to the ongoing construction of the massive edifice.
> (List: I know, I
> know, Unicode does not encode written language, it encodes
> characters/scripts/whatever. But the perception will always and
> inevitably be that it is an encoding or modeling of written language.)
This, however, is a serious stumbling block.
The Unicode Standard does not standardize languages. It does not
standardize writing systems. It does not standardize orthographies.
It does not standardize alphabets or syllabaries.
It does not standardize spelling systems. It does not standardize
letters. It does not standardize fonts. It does not standardize
formats for written materials. Heck, it does not standardize
*anything* about written language that an average well-educated user
of that language would recognize.
So no wonder people get confused. If they ask well, then what *does*
it standardize, we say encoded characters for scripts and for
sets of symbols. And then they may well come back with the moral
equivalent of "Well, those are the letters of my alphabet for my
language, and they look screwed up to me."
And so we start all over again trying to explain the basics of
character encoding and how that relates to the implementation of
writing systems on computers. And some people get it and some
people never do.
> On the fourth hand, it's also clear (to me at least) that Unicode works
> great for some linguistic communities and not so great for others. (You
> knew it was coming, and here it is: Unicode is very bad indeed for the
> RTL community in general and Arabic in particular. ;-) This gets back
> to the design principles (and the interests that drive them) of Unicode,
> which work better for some languages than others.
for some... writing systems than others.
But I think that has a much (or more) to do with the writing
systems as with Unicode principles in particular.
The plain truth is that some writing systems are much more
straightforward and simple to implement in a digital information
system than others. Arabic is particularly difficult for a number
of reasons: it is written right-to-left, which is the non-predominant
order for most writing systems and which happens (for obvious
reasons) not to be the default that computer systems were
originally designed for; it standardized on cursive form printing,
and has very complex and important calligraphic traditions;
it is a consonant writing system, with various layers of "dotting"
than have built up on the consonant spine over the centuries for
indicating voweling, consonant diacritics for adaptation of
the script to other languages, and multiple levels of annotation
for sacred, sung, or chanted text. And, like Latin and Cyrillic,
it has been a widespread, cosmopolitan, "empire" script, which
means it has huge variation and lots of adaptation issues as it
moved from language to language and area to area. And perhaps
not least, it is the script of an important sacred book, which
means that it is fraught with religion, as well as all the
usual cultural identity issues associated with scripts.
Furthermore, as much as it would be nice to have Arabic simply
be implemented consistently right-to-left, in any *practical*
implementation, you *must* deal with bidirectionality.
I realize that you think you may have a better mousetrap in
approaching the problem of encoding Arabic text than the
encoding used in the Unicode Standard --- but...
However you cut the pie, you are still faced with the
difficulties that the script presents you in dealing with
the basic information processing requirements: keyboard
input, text storage, searching, sorting, editing, layout
and rendering, and so on. The whole stack of information
processing has to function -- and has to function in the
context of existing software systems, data storage technologies,
databases, fonts, libraries, internet protocols, and on and
on ... or you haven't got any solution at all. Just ideas
and a theory.
> And then there are the pragmatic issues which you have outlined
> concisely in another message.
Yep.
For a character encoding, in particular, it has to not only
work de novo, but to have any success at all, it has to
function in transition from whatever exists a priori, and
has to have a 20-year transition strategy during which
existing data stores convert, interoperate, and don't cause
unavoidable confusion, ambiguity, and data loss.
That, by the way, is part of that proverbial high mountain
I was talking about in an earlier post.
> Personally, I think Unicode is (well, may be) of enormous historical
> significance, yet it flies almost entirely under the cultural radar, at
> least in the US. I daresay most places in the world that will
> eventually be heavily influenced by Unicode are more or less oblivious
> to it.
I agree. But then you could say the same thing about ASCII and
Latin-1 before it. They are part of the guts of information
technology, and most people are oblivious to the details, as they
are regarding nearly *all* technology of whatever sort.
> > http://linguistics.berkeley.edu/sei/
> >
>
> Thanks, very interesting. I see many of the scripts being worked on
> list one "Everson" as the contact. Who is this mysterious and
> ubiquitous "Everson", anyway? Is it one person? Sounds an awful lot
> like the fictional Cecil Adams to me:
> (http://www.straightdope.com/index.html)
There have been reliable sightings of an "Everson" in at least
9 widely separated locations around the globe just within the
last year. Our best intelligence estimate at the moment is that
this must be an organization of agents numbering at least 5 -- with
elaborate disguises -- to account for all the activity involved.
The front for this organization can be seen at:
The "Everson" heard recently posting on this very list bears
little apparent resemblance to the "Everson" I had tea with
in Xiamen, China last January. :-)
--Ken
This archive was generated by hypermail 2.1.5 : Fri Jul 08 2005 - 20:24:03 CDT