Re: About Encoding Theory (was: Re: Again not about Phoenician)

From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Nov 09 2004 - 07:51:27 CST

Next message: Mark Davis: "Public Review Items"

Previous message: Markus Scherer: "Re: Looking for a C library that converts UTF-8 strings from their decomposed to pre-composed form"
In reply to: Kenneth Whistler: "About Encoding Theory (was: Re: Again not about Phoenician)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 09/11/2004 02:30, Kenneth Whistler wrote:

>Peter Kirk suggested:
>
>
>
>>I am suggesting that the best way to get the job done properly is to lay
>>the conceptual foundation properly first, instead of trying to build a
>>structure on a foundation which doesn't match...
>>
>>
>
>Part of the problem that I think some people are having here,
>including Peter, is that they are ascribing the wrong level
>to the Unicode Standard itself.
>
>

Maybe. But why is this? Is it because the Standard describes itself
misleadingly? Is it because it has been oversold? Is it because people
who are looking for a conceptual framework look to the text of the
Standard, and think they have found one there when in fact what they
find is something different?

For example, a professor described on this list as one of the most
famous in his field wrote that each of the proposers and supporters of a
script proposal "either does not understand Unicode or (and probably
"and") does not understand what a glyph is" (quoted on this list in May
this year). Implicitly his criticism applies even to the majority of UTC
members who accepted the proposal. Was he being unreasonable? What was
his basis for claiming to understand Unicode better than the UTC
members? I can't speak for the professor, but I would suppose that his
claim to understand Unicode is based to a large extent on his reading of
the Standard, and explanations from others who have read it. If this
professor, a leading expert in his field, is finding such
inconsistencies, and as a result of them is slandering the UTC and
rejecting Unicode, doesn't this suggest that there is something wrong?

>...
>
>The Unicode Standard is *NOT* a standard for the theory
>or process of character encoding. It does not spell out
>the rules whereby character encoding committees are
>constrained in their process, nor does it lay down
>specifications that would allow anyone to follow some
>recipe in determining what "thing" is a separate script
>and what is not, nor what "entity" is an appropriate
>candidate for encoding as a character and what is not.
>
>

It does not normatively specify such things, agreed. But it does appear
to describe them, at least in outline, in its informative section
entitled "Unicode Design Principles". And these outline descriptions are
misleading. All I am asking is that the misleading text be adjusted so
that it is not misleading and is consistent with the actual practice of
the UTC. I have proposed one way to do so. You may prefer another way,
perhaps something like replacing "Characters are the abstract
representations of the smallest components of written language that have
semantic value." on p.15 by "... the smallest components of written
language which have been determined by the character encoding committees
to be usefully distinguishable." That may be too obviously ad hoc, but
at least it stops people trying to interpret "semantic value" as
something of theoretical significance.

>... Even *cataloging* the world's
>writing systems is immensely controversial -- let alone
>trying to hammer some significant set of "historical nodes"
>into a set of standardized encoded characters that can
>assist in digital representation of plain text content
>of the world's accumulated and prospective written heritage.
>
>

Indeed. But if such a standardised set is to be generally acceptable,
the controversies have to be resolved, and they should be resolved by
open discussion and diplomatic decision-making, not by imposition of one
view and accusations that those who hold other views are not "reasonable".

>Contrary to what Peter is suggesting, I think it is putting
>the cart before the horse to expect a standard theory of
>script encoding to precede the work to actually encode
>characters for the scripts of the world.
>
>

Well, a standard theory is more than what I was asking for. I was
looking for an accurate summary description of the criteria currently
being used; or failing that, at least deletion of the current inaccurate
description.

>The Unicode Standard will turn out the way it does, with
>all its limitations, warts, and blemishes, because of a
>decades-long historical process of decisions made by
>hundreds of people, often interacting under intense pressure.
>
>Future generations of scholar will study it and point out
>its errors.
>
>Future generations of programmers will continue to use it
>as a basis for information processing, and will continue
>to program around its limitations.
>
>

I agree, of course, that Unicode will not be perfect. But that is not an
argument not to do the best job we can do now. Future scholars will have
fewer errors to point out if when present-day scholars point out
supposed errors in proposals they are listened to and not told things
like "I can't say that I care a fig". And future programmers will have
fewer limitations to program around, at great expense, if more care is
taken to avoid defining and stabilising such limitations. Anyway, what
is the great hurry? There may be one with certain modern scripts, but I
don't see much urgency with historic scripts. Just listening more and
taking more care will help to put off the inevitable *THEN* when Unicode
has to be replaced.

>And I expect that *THEN* a better, comprehensive theory of
>script and symbol encoding for information processing will
>be developed. And some future generation of information
>technologists will rework the Unicode encoding into a new standard
>of some sort, compatible with then-existing "legacy" Unicode
>practice, but avoiding most of the garbage, errors, and
>8-bit compatibility practice that we currently have to
>live with, for hundreds of accumulated (and accumulating)
>reasons.
>
>

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Mark Davis: "Public Review Items"
Previous message: Markus Scherer: "Re: Looking for a C library that converts UTF-8 strings from their decomposed to pre-composed form"
In reply to: Kenneth Whistler: "About Encoding Theory (was: Re: Again not about Phoenician)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 09 2004 - 12:53:24 CST