Re: Surrogate points

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Feb 03 2005 - 17:33:11 CST

  • Next message: Magda Danish \(Unicode\): "FW: Subj: Ansi/Unicode data diagnostics"

    O.k. I have been holding my tongue, but this particular
    tower of blather in the colloquy between Philippe Verdy
    and Hans Aberg requires some corrections.

    Hans Aberg said:

    > [Off the list.]
    > >> The problem with Unicode is that it seems to attempt to do to much
    > >> in this category. It should focus on the character model, and let the
    > >> other things to other work and standards. That would remove a great
    > >> deal of controversy around it.
    > >

    and Philippe Verdy responded:

    > >At least on this, I can agree with you.
    > >I think that Unicode's attempt to cover too many things in the same
    > >standard will fail in a more or less long term. The Unicode standard
    > >should be splitted into separate working domains.

    Hans and Philippe are, of course, entitled to their opinions about
    what "Unicode" should do, but this repartee seems to reflect an
    ignorance about what standardization is actually going on.

    "Unicode" is not a single standard, nor does "Unicode" *do* things.

    The *Unicode Consortium* is an SDO (Standards Development Organization).
    At current count it maintains and develops 6 different standards.
    The Unicode Standard is the largest and most important of those,
    of course, but it is only one. The Unicode Consortium is currently
    also the registration authority for two standards. It maintains the
    Common Locale Data Repository (CLDR), which depends on the LDML
    standard. (See: http://www.ujnicode.org/cldr/ ) It also is the
    registration authority for ISO 15924, Script Codes.
    (See: http://www.unicode.org/iso15924/ )

    From the Bylaws of Unicode, Inc., the formal, incorporated entity
    which runs the Unicode Consortium:

    "Section 1. Purpose
     This Corporation's purpose shall be to extend, maintain and
     promote the Unicode Standard and other standards related to
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     internationalization technologies."
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     
    That wording has the approval of the Board of Directors, officers,
    and corporate members of the Unicode Consortium.

    Now, as I said, Hans and Philippe are entitled to their opinions,
    but it would be desirable if those opinions were at least grounded
    on an understanding of the actual activities of the Unicode Consortium.

    Hans responded:

    > I think this is what is causing a lot of heat on the list, a combination of
    > trying to do too much, in combination with the requirement of no change.

    Heat on the list is caused by behavior of participants on the
    list, not by the scope of the Unicode Standard, nor by stability
    requirements imposed on that standard.

    > Then, a series of issue have been resolved by compromises, which are not
    > generally useful.

    This statement is stunningly at odds with the assessment of the
    committees involved in the maintenance of the Unicode Standard
    and ISO/IEC 10646. Compromise in the face of conflicting requirements
    is the soul of consensus in the development of the standards, and
    in the case of the Unicode Standard has been responsible for the
    wide adoption and general success of the standard as an underpinning
    for text processing in worldwide IT contexts.

    > When complaints naturally arrive, one has locked ones
    > positions in view of the non-change requirement, which produces a rather
    > defensive, aggressive stance on the list.

    A standard, once widely implemented, needs to be defended against
    destabilizing changes. That should be taken as a given.

    The aggressive nature of some of the responses on the threads
    initiated by Hans isn't caused by the stability requirements
    of the standard, but rather has been responding to the
    arrogantly challenging nature of the proposals being made.

    Philippe said:

    > >Collation for example is not bound to the character encoding itself. I
    > >think it should go out of Unicode's standard,

    The UCA is not a part of the Unicode Standard, but is a separate
    standard in its own right.

    > >and be worked by another
    > >group, without being bound to Unicode character properties.

    This is a matter of opinion, of course, but I disagree with
    Philippe's assessment.

    Hans responded:

    > Other issues that should not be in Unicode are file encodings, process
    > handling.

    And they aren't dealt with by the Unicode Standard, contrary to
    Hans' implication here.

    > Also the endianess of the representation of number is languages
    > seems to be wrong.

    May "seem to be wrong" to Hans, but is not.

    > So there seems to be a range of issues that should be
    > lopped off current Unicode.

    Philippe said:

    > >I think that ISO would be a better place to work on collation, because
    > >it's not a character encoding issue, but a more general issue about
    > >handling linguistic data and semantics.

    It is certainly the case that collation is not a character encoding
    issue per se.

    But the presupposition here is that "ISO" is better equipped to handle
    matters of linguistic data and semantics. "ISO" doesn't handle anything
    of the sort. It is a Standards Development Organization that pushes
    all matters of technical expertise down into appropriate working
    groups. The subcommittee in ISO that has both the formal responsibility
    for International String Ordering now is SC2, the same subcommittee
    that deals with -- surprise -- ISO/IEC 10646. And the expertise regarding
    collation in that subcommittee resides in Working Group 2, the
    working group that does all the work on 10646.

    That alignment of activities, parallel on the ISO side for 10646 and
    14651 and on the Unicode Consortium side for the Unicode Standard
    and UCA, should hardly be surprising, because the main issue for
    both ISO/IEC 14651 and for the UCA is the appropriate extension
    of the main tables (CTT for 14651 and DUCET for UCA) as characters
    are added to the repertoire of 10646 and the Unicode Standard.

    > >A unique solution for collation
    > >will not work for all languages.

    This is true. It is also well-understood and accounted for by the
    developers and maintainers of both UCA and ISO/IEC 14651.

    > >I think that a more open standard that
    > >will be based on various profiles (including Unicode's UCA as one of
    > >those profiles) with more precise but more open definitions bound in
    > >priority to linguistic issues would be welcome.

    But this seems to reflect a lack of understanding of the nature of
    tailoring for particular collations, both in UCA and ISO/IEC 14651.

    > It has been discussed a bit in the LaTeX list, and it is clear that these
    > language and region related issues are very complex. Other issues are how to
    > represent dates in various localities, where the same language, but
    > different locality will use different conventions. For example, Australian,.
    > UK, and US conventions. Then people may make a pick between different
    > conventions in their text. So if Unicode stick its nose into those water,
    > one is likely to get water over the head.

    Please see:

    http://www.unicode.org/cldr/

    On the contrary, I'd say it is Philippe and Hans who are swimming in
    the deep end without a life preserver here.

    Philippe continued:

    > >May be Unicode.org
    > >could become the registry agency for those profiles (for example if the
    > >registry is made part of CLDR). But UCA and the Unicode's DUCET is
    > >unusable as such.

    This is demonstrably false. It is currently being used as the basis
    of shipping software in major implementations.

    > > New collation algorithms are needed that will make
    > >things simpler and more efficient to cover large sets of languages for
    > >which the algorithm is poor (inprecise or ambiguous) and inefficient
    > >(slow, complex to implement)

    The UCA is certainly complex to implement, but then so is every
    similar approach to multi-level, linguistically appropriate sortkey
    weighting that preceded it. A proper implementation is *NOT* slow.

    > >On the opposite, working groups on collation categorized by linguistic
    > >domains could be created at ISO, to cover several groups of languages,
    > >based only on the ISO10646 character repertoire, and with their own
    > >sets of character properties independant of Unicode, these properties
    > >becoming significant only for the covered languages.

    And that is a recipe for chaos and non-interoperability.

    It is one thing to propose that groups with regional expertise
    in sorting practice for one language or group of languages develop
    specifications regarding how that language or group of languages
    should be sorted. It is an entirely different matter to be
    proposing that they disanchor their specifications from
    the specification of Unicode character properties.

    > >another example: the set of normative properties in Unicode needed
    > >before characters can be standardized is becoming too large. This is a
    > >critical issue, because it is slowing the standardization process also
    > >at ISO10646.

    False.

    > >So Unicode tends to assign too early normative properties
    > >that will become unusable later, and that will require application
    > >developers to use their own non-compliant solutions (examples found and
    > >needed today with Biblic Hebrew

    This is an invalid generalization from a set of known issues regarding
    fixed position combining classes for Hebrew points -- known issues
    that have been chewed over ad nauseum on the Hebrew list here, and
    which have yielded to solutions as a result of the very process
    of compromise apparently decried by Hans above.

    > >and Phoenician).

    And that is an utter non-sequitur, because the normative properties
    of Phoenician characters do not and never have had any connection to the
    controversy over the encoding of the script.

    Hans continued:

    > That seems to be problem with Unicode: by wanting to do too much, one will
    > provide norms that merely will be disobeyed. THis is a gener problem with
    > standards, not only Unicode. Therefore, quite few standrds will never be in
    > effect used.

    Bypassing the faulty logic here, I would point out that the Unicode
    Standard *is* used and its specifications *are* followed, rather well,
    in fact, by many vendors.

    Philippe continued:
     
    > >Splitting the standard would help abandonning some parts of it in favor
    > >of other ones. So applications could be still conforming to ISO10646
    > >and a reduced Unicode standard, but could adopt other standards for all
    > >other domains not covered by the core Unicode standard.

    This is utter pie in the sky. Not only do I see no motivation for this --
    there is nobody waiting in the wings to take on this task.

    > >doing this
    > >should not require reencoding texts. But it could really accelerate the
    > >production of documents with still unencoded scripts or characters.

    And this is nonsense. It would only increase uncertainty and
    confusion, and would *slow down* the production of documents for
    still unencoded scripts or characters.

    Experts representing living minority scripts not yet in the Unicode
    Standard or not yet fully covered by the Unicode Standard "get it"
    now. In the past year there have been excellent examples of
    cooperative collaboration that have speeded the encoding for
    Tifinagh, N'Ko (for Mandekan speakers in Guinea and neighboring
    countries), extensions for Ethiopic, and Balinese. Many others are
    in the works.

    It is a shame that Philippe doesn't "get it" that splitting such
    efforts off from the general process of extension of the Unicode
    Standard would have the net effect of isolating and disenfranchising
    such groups, rather than enabling them in the IT world.

    Hans said:

    > I think that Unicode should focus on providing the character set, the
    > character numbering, and in some cases, rules for combined characters.

    Hans is entitled to think that, but he is wrong. The accumulated
    engineering expertise of the software engineers working on
    the standard and its implementation over the last 15 years is,
    in fact, what has driven the Unicode Consortium to incorporate
    all kinds of semantic information beyond mere character
    encoding repertoire into the Unicode Standard. Han's position
    is approximately where the Unicode founders were at in 1989, in
    their thinking about what the task was for the Unicode Standard.
    He has a little catching up to do here.

    > If
    > the encoding issue would have been handled correctly, it would have been
    > completely independent of these issues.

    It isn't clear what Hans means by that statement, but whatever
    it is, I suspect he's wrong on that, too. ;-)

    Philippe continued:

    > >Finally, Unicode does not cover a domain which is very important for
    > >the creation of numeric text corpus: orthographies (and their
    > >associated conventions).

    That is true.

    > >This is a place where nothing can be
    > >standardized without breaking some Unicode conformance level,

    But that statement is clearly false.

    > >even
    > >though standard orthographies could be much more easily developed based
    > >only on the ISO10646 repertoire definition.

    And so is that one.

    Hans responded:
     
    > This clearly belongs to the question of lexing and parsing a sequence of
    > characters. Unicode should stay out of that as much as possible, I think.

    If by "this", Hans is referring to the development of orthographies
    as mentioned by Philippe, then that has little to do with lexing
    and parsing issues, per se. But I agree that the Unicode Standard
    should not (and does not) specify anything regarding orthographies.
     
    > >So the good question is: are all those character properties in Unicode
    > >needed or pertinent to cover all languages of the world? Unicode has no
    > >right and no expertise in linguistic issues; only in encoding issues.

    First of all, Unicode character properties have no direct bearing
    on linguistic issues, anyway.

    Second, the developers of the Unicode Standard *do* have expertise
    in linguistic issues, contrary to Philippe's apparent claim. The
    Unicode Standard itself is not, of course, directed at standardizing
    any linguistic or orthographic issue.

    Hans started freewheeling:
     
    > I can think of more than one character set, going in different directions
    > relative Unicode: One that is optimized by having as few characters as
    > possible. Another, going the opposite direction, might be more ample in the
    > set of characters, perhaps having one for each language-locality combination
    > that is unique. I do not think there is one set that is the right one; it
    > depends on what design objectives on has.

    Well, yes, I can think of things like that, too. And the point is?

    What we have for a *universal* character encoding standard is the
    Unicode Standard (and the synchronized standard ISO/IEC 10646).
    That is the result of the combined efforts of literally 100's of
    contributors over a 15-year time span.

    Perhaps Hans would be so kind as to gift us with particulars as to
    by whom and how his imagined alternatives are to be realized and then
    rolled out into real implementations?
     
    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Feb 03 2005 - 17:35:59 CST