Re: Surrogate points

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Feb 03 2005 - 17:33:11 CST

Next message: Magda Danish \(Unicode\): "FW: Subj: Ansi/Unicode data diagnostics"

Previous message: D. Starner: "Re: Surrogate points"
Maybe in reply to: Hans Aberg: "Re: Surrogate points"
Next in thread: Hans Aberg: "Re: Surrogate points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

O.k. I have been holding my tongue, but this particular
tower of blather in the colloquy between Philippe Verdy
and Hans Aberg requires some corrections.

Hans Aberg said:

> [Off the list.]
> >> The problem with Unicode is that it seems to attempt to do to much
> >> in this category. It should focus on the character model, and let the
> >> other things to other work and standards. That would remove a great
> >> deal of controversy around it.
> >

and Philippe Verdy responded:

> >At least on this, I can agree with you.
> >I think that Unicode's attempt to cover too many things in the same
> >standard will fail in a more or less long term. The Unicode standard
> >should be splitted into separate working domains.

Hans and Philippe are, of course, entitled to their opinions about
what "Unicode" should do, but this repartee seems to reflect an
ignorance about what standardization is actually going on.

"Unicode" is not a single standard, nor does "Unicode" *do* things.

The *Unicode Consortium* is an SDO (Standards Development Organization).
At current count it maintains and develops 6 different standards.
The Unicode Standard is the largest and most important of those,
of course, but it is only one. The Unicode Consortium is currently
also the registration authority for two standards. It maintains the
Common Locale Data Repository (CLDR), which depends on the LDML
standard. (See: http://www.ujnicode.org/cldr/ ) It also is the
registration authority for ISO 15924, Script Codes.
(See: http://www.unicode.org/iso15924/ )

From the Bylaws of Unicode, Inc., the formal, incorporated entity
which runs the Unicode Consortium:

"Section 1. Purpose
This Corporation's purpose shall be to extend, maintain and
promote the Unicode Standard and other standards related to
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
internationalization technologies."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That wording has the approval of the Board of Directors, officers,
and corporate members of the Unicode Consortium.

Now, as I said, Hans and Philippe are entitled to their opinions,
but it would be desirable if those opinions were at least grounded
on an understanding of the actual activities of the Unicode Consortium.

Hans responded:

> I think this is what is causing a lot of heat on the list, a combination of
> trying to do too much, in combination with the requirement of no change.

Heat on the list is caused by behavior of participants on the
list, not by the scope of the Unicode Standard, nor by stability
requirements imposed on that standard.

> Then, a series of issue have been resolved by compromises, which are not
> generally useful.

This statement is stunningly at odds with the assessment of the
committees involved in the maintenance of the Unicode Standard
and ISO/IEC 10646. Compromise in the face of conflicting requirements
is the soul of consensus in the development of the standards, and
in the case of the Unicode Standard has been responsible for the
wide adoption and general success of the standard as an underpinning
for text processing in worldwide IT contexts.

> When complaints naturally arrive, one has locked ones
> positions in view of the non-change requirement, which produces a rather
> defensive, aggressive stance on the list.

A standard, once widely implemented, needs to be defended against
destabilizing changes. That should be taken as a given.

The aggressive nature of some of the responses on the threads
initiated by Hans isn't caused by the stability requirements
of the standard, but rather has been responding to the
arrogantly challenging nature of the proposals being made.

Philippe said:

> >Collation for example is not bound to the character encoding itself. I
> >think it should go out of Unicode's standard,

The UCA is not a part of the Unicode Standard, but is a separate
standard in its own right.

> >and be worked by another
> >group, without being bound to Unicode character properties.

This is a matter of opinion, of course, but I disagree with
Philippe's assessment.

Hans responded:

> Other issues that should not be in Unicode are file encodings, process
> handling.

And they aren't dealt with by the Unicode Standard, contrary to
Hans' implication here.

> Also the endianess of the representation of number is languages
> seems to be wrong.

May "seem to be wrong" to Hans, but is not.

> So there seems to be a range of issues that should be
> lopped off current Unicode.

Philippe said:

> >I think that ISO would be a better place to work on collation, because
> >it's not a character encoding issue, but a more general issue about
> >handling linguistic data and semantics.

It is certainly the case that collation is not a character encoding
issue per se.

But the presupposition here is that "ISO" is better equipped to handle
matters of linguistic data and semantics. "ISO" doesn't handle anything
of the sort. It is a Standards Development Organization that pushes
all matters of technical expertise down into appropriate working
groups. The subcommittee in ISO that has both the formal responsibility
for International String Ordering now is SC2, the same subcommittee
that deals with -- surprise -- ISO/IEC 10646. And the expertise regarding
collation in that subcommittee resides in Working Group 2, the
working group that does all the work on 10646.

That alignment of activities, parallel on the ISO side for 10646 and
14651 and on the Unicode Consortium side for the Unicode Standard
and UCA, should hardly be surprising, because the main issue for
both ISO/IEC 14651 and for the UCA is the appropriate extension
of the main tables (CTT for 14651 and DUCET for UCA) as characters
are added to the repertoire of 10646 and the Unicode Standard.

> >A unique solution for collation
> >will not work for all languages.

This is true. It is also well-understood and accounted for by the
developers and maintainers of both UCA and ISO/IEC 14651.

> >I think that a more open standard that
> >will be based on various profiles (including Unicode's UCA as one of
> >those profiles) with more precise but more open definitions bound in
> >priority to linguistic issues would be welcome.

But this seems to reflect a lack of understanding of the nature of
tailoring for particular collations, both in UCA and ISO/IEC 14651.

> It has been discussed a bit in the LaTeX list, and it is clear that these
> language and region related issues are very complex. Other issues are how to
> represent dates in various localities, where the same language, but
> different locality will use different conventions. For example, Australian,.
> UK, and US conventions. Then people may make a pick between different
> conventions in their text. So if Unicode stick its nose into those water,
> one is likely to get water over the head.

Please see:

http://www.unicode.org/cldr/

On the contrary, I'd say it is Philippe and Hans who are swimming in
the deep end without a life preserver here.

Philippe continued:

> >May be Unicode.org
> >could become the registry agency for those profiles (for example if the
> >registry is made part of CLDR). But UCA and the Unicode's DUCET is
> >unusable as such.

This is demonstrably false. It is currently being used as the basis
of shipping software in major implementations.

> > New collation algorithms are needed that will make
> >things simpler and more efficient to cover large sets of languages for
> >which the algorithm is poor (inprecise or ambiguous) and inefficient
> >(slow, complex to implement)

The UCA is certainly complex to implement, but then so is every
similar approach to multi-level, linguistically appropriate sortkey
weighting that preceded it. A proper implementation is *NOT* slow.

> >On the opposite, working groups on collation categorized by linguistic
> >domains could be created at ISO, to cover several groups of languages,
> >based only on the ISO10646 character repertoire, and with their own
> >sets of character properties independant of Unicode, these properties
> >becoming significant only for the covered languages.

And that is a recipe for chaos and non-interoperability.

It is one thing to propose that groups with regional expertise
in sorting practice for one language or group of languages develop
specifications regarding how that language or group of languages
should be sorted. It is an entirely different matter to be
proposing that they disanchor their specifications from
the specification of Unicode character properties.

> >another example: the set of normative properties in Unicode needed
> >before characters can be standardized is becoming too large. This is a
> >critical issue, because it is slowing the standardization process also
> >at ISO10646.

False.

> >So Unicode tends to assign too early normative properties
> >that will become unusable later, and that will require application
> >developers to use their own non-compliant solutions (examples found and
> >needed today with Biblic Hebrew

This is an invalid generalization from a set of known issues regarding
fixed position combining classes for Hebrew points -- known issues
that have been chewed over ad nauseum on the Hebrew list here, and
which have yielded to solutions as a result of the very process
of compromise apparently decried by Hans above.

> >and Phoenician).

And that is an utter non-sequitur, because the normative properties
of Phoenician characters do not and never have had any connection to the
controversy over the encoding of the script.

Hans continued:

> That seems to be problem with Unicode: by wanting to do too much, one will
> provide norms that merely will be disobeyed. THis is a gener problem with
> standards, not only Unicode. Therefore, quite few standrds will never be in
> effect used.

Bypassing the faulty logic here, I would point out that the Unicode
Standard *is* used and its specifications *are* followed, rather well,
in fact, by many vendors.

Philippe continued:

> >Splitting the standard would help abandonning some parts of it in favor
> >of other ones. So applications could be still conforming to ISO10646
> >and a reduced Unicode standard, but could adopt other standards for all
> >other domains not covered by the core Unicode standard.

This is utter pie in the sky. Not only do I see no motivation for this --
there is nobody waiting in the wings to take on this task.

> >doing this
> >should not require reencoding texts. But it could really accelerate the
> >production of documents with still unencoded scripts or characters.

And this is nonsense. It would only increase uncertainty and
confusion, and would *slow down* the production of documents for
still unencoded scripts or characters.

Experts representing living minority scripts not yet in the Unicode
Standard or not yet fully covered by the Unicode Standard "get it"
now. In the past year there have been excellent examples of
cooperative collaboration that have speeded the encoding for
Tifinagh, N'Ko (for Mandekan speakers in Guinea and neighboring
countries), extensions for Ethiopic, and Balinese. Many others are
in the works.

It is a shame that Philippe doesn't "get it" that splitting such
efforts off from the general process of extension of the Unicode
Standard would have the net effect of isolating and disenfranchising
such groups, rather than enabling them in the IT world.

Hans said:

> I think that Unicode should focus on providing the character set, the
> character numbering, and in some cases, rules for combined characters.

Hans is entitled to think that, but he is wrong. The accumulated
engineering expertise of the software engineers working on
the standard and its implementation over the last 15 years is,
in fact, what has driven the Unicode Consortium to incorporate
all kinds of semantic information beyond mere character
encoding repertoire into the Unicode Standard. Han's position
is approximately where the Unicode founders were at in 1989, in
their thinking about what the task was for the Unicode Standard.
He has a little catching up to do here.

> If
> the encoding issue would have been handled correctly, it would have been
> completely independent of these issues.

It isn't clear what Hans means by that statement, but whatever
it is, I suspect he's wrong on that, too. ;-)

Philippe continued:

> >Finally, Unicode does not cover a domain which is very important for
> >the creation of numeric text corpus: orthographies (and their
> >associated conventions).

That is true.

> >This is a place where nothing can be
> >standardized without breaking some Unicode conformance level,

But that statement is clearly false.

> >even
> >though standard orthographies could be much more easily developed based
> >only on the ISO10646 repertoire definition.

And so is that one.

Hans responded:

> This clearly belongs to the question of lexing and parsing a sequence of
> characters. Unicode should stay out of that as much as possible, I think.

If by "this", Hans is referring to the development of orthographies
as mentioned by Philippe, then that has little to do with lexing
and parsing issues, per se. But I agree that the Unicode Standard
should not (and does not) specify anything regarding orthographies.

> >So the good question is: are all those character properties in Unicode
> >needed or pertinent to cover all languages of the world? Unicode has no
> >right and no expertise in linguistic issues; only in encoding issues.

First of all, Unicode character properties have no direct bearing
on linguistic issues, anyway.

Second, the developers of the Unicode Standard *do* have expertise
in linguistic issues, contrary to Philippe's apparent claim. The
Unicode Standard itself is not, of course, directed at standardizing
any linguistic or orthographic issue.

Hans started freewheeling:

> I can think of more than one character set, going in different directions
> relative Unicode: One that is optimized by having as few characters as
> possible. Another, going the opposite direction, might be more ample in the
> set of characters, perhaps having one for each language-locality combination
> that is unique. I do not think there is one set that is the right one; it
> depends on what design objectives on has.

Well, yes, I can think of things like that, too. And the point is?

What we have for a *universal* character encoding standard is the
Unicode Standard (and the synchronized standard ISO/IEC 10646).
That is the result of the combined efforts of literally 100's of
contributors over a 15-year time span.

Perhaps Hans would be so kind as to gift us with particulars as to
by whom and how his imagined alternatives are to be realized and then
rolled out into real implementations?

--Ken

Next message: Magda Danish \(Unicode\): "FW: Subj: Ansi/Unicode data diagnostics"
Previous message: D. Starner: "Re: Surrogate points"
Maybe in reply to: Hans Aberg: "Re: Surrogate points"
Next in thread: Hans Aberg: "Re: Surrogate points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Feb 03 2005 - 17:35:59 CST