Re: Changing UCA primarly weights (bad idea)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Jul 09 2004 - 14:27:42 CDT

  • Next message: D. Starner: "Re: Looking for transcription or transliteration standards latin- >arabic"

    comments below

    ‎Mark

    ----- Original Message -----
    From: "Michael Everson" <everson@evertype.com>
    To: <unicode@unicode.org>
    Sent: Friday, July 09, 2004 09:37
    Subject: Changing UCA primarly weights (bad idea)

    > At 09:06 -0700 2004-07-09, Mark Davis wrote:
    >
    > >There is a proposal being worked on to change the UCA primary weights,
    e.g.,
    > >to give the same primary weights to O and O WITH STROKE, but as of this
    > >point the UCA does not fold the following cases marked "!uca".
    >
    > I would like to point out that a number of people OPPOSE this
    > proposal strongly, and even oppose the fact that some people are
    > working on such a proposal. It has many disadvantages.
    >
    > 1) it destabilizes the default tailorable template of ISO/IEC 14651
    > and the UCA which has been published for some time. Anyone who *has*
    > tailored it would have to do all that work all over again.

    You are certainly right that this is not a slam-dunk; there are reasons for
    and against it. And it may well be that the committee decides against it.

    However, you overstate the situation with tailorings. The only tailorings
    that would be affected are ones where the tailoring depends on inheriting
    the order from the UCA for the affected characters. In a great many of these
    cases, the UCA order must be tailored anyway for any of these characters
    that are needed in the languages. For example, Ø must be tailored for
    Danish:

    http://oss.software.ibm.com/cvs/icu/~checkout~/locale/common/collation/da.xml

    In the case of CLDR, we explicitly do not depend on the UCA ordering for
    these characters; for example, in Polish you will see explicit weighting of
    Ł.

    http://oss.software.ibm.com/cvs/icu/~checkout~/locale/common/collation/pl.xml

    So the number of tailorings that in practice would be affected I suspect to
    be very small. However, if you have actual evidence of tailorings that would
    be adversely affected by John Cowan's list, I would love to see it. If you
    don't have any evidence, you probably shouldn't try to push this point!

    >
    > 2) it proposes to reverse the *explicit* design principles that went
    > into the default tailorable template in the *first* place. Similar
    > letters are near -- but not interfiled with -- similar letters. This
    > is MORE than enough to give any casual user the functionality he
    > needs, because only in initial position is there likely to be any
    > confusion in real-life sorted word lists, and even then, hooked-b
    > follows bz, which is hardly burdensome for the end user.

    This also completely overstates the case. What we actually did was to put
    similar letters near other letters, *and if their decompositions were the
    same* we interfiled them. There is, however, little principled difference
    between Å, Ł , Ļ , Ñ, Ø, Ơ, and Ô that would cause a user to think that the
    some should be interfiled and some should not. In some languages these would
    be seen as "separate letters" (e.g. with different primary weights) and in
    others not; but that does not line up in any particular way with what is in
    the UCA. (see also comment below).

    See http://www.unicode.org/charts/collation/chart_Latin.html for many other
    cases.

    >
    > 3) in discussions elsewhere, Mark has talked about what "most users"
    > "expect" and I found his suggestion to be anglocentric and
    > unsubstantiated.

    And I will refrain from saying what I think of your reasoning ability in
    general, although circularity seems to be a particular specialty. I suggest
    that we stick to the facts instead of ad hominem attacks.

    For user expectations, check out how foreign words with unusual accents are
    sorted in a variety of languages. I have seen no reason to believe that
    Germans or French or others behave much differently when faced with a letter
    like ø that is not one that they use. The key is whether they would expect
    to see:

    a) Interleaved:
    ..oa..
    ..øb..
    ..oz..

    b) Separate but near:
    ..oz..
    ..øb..
    ..pa..

    c) Like a particular language (Danish)
    ..yb..
    ..øb..

    ============

    a) Interleaved:
    ..oa..
    ..öb..
    ..oz..

    b) Separate but near:
    ..oz..
    ..öb..
    ..pa..

    c) Like a particular language (Swedish or Phonebook German)
    ..yb..
    ..öb..

    ..od..
    ..öz..
    ..of..

    People I've talked to, from various different backgrounds, have expected
    behavior (a) for both letters ø and ö, or occasionally (b) for them.
    *Nobody* expected the UCA-type inconsistency: behavior (c) for ø, but
    behavior (a) for ö.

    Moreover, this is also inconsistent with any generative use of characters
    like stroke, since they are always interfiled in UCA.

    >
    > 4) the CORRECT behaviour for individual letters already occurs with
    > the default tailorable template. Each individual e-like I.P.A. letter
    > sorts near, but not among, all the I.P.A. letters. That's as should
    > be. The proposal would interfile hundreds of letters within the
    > twenty-six letters A-Z and add some thorns and clicks at the end.
    > Therefore everyone BUT Mark's "most users" will have to tailor to get
    > anything like correct behaviour. Put another way: "most users" won't
    > see and don't care about hooked-b, but the template as it stands
    > gives the correct behaviour for it.

    More accurately, you believe that the correct behavior occurs. (Sadly, using
    BOLDFACE doesn't make it more true.) But you offer no evidence. Å is seen as
    a separate letter in the languages that use it, but UCA "interfiles" it. Ł
    is also seen as a separate letter, and UCA doesn't. Let's hear some evidence
    from your side, like people's reactions to the above cases.

    >
    > 5) if Mark wants to make a tailoring to interfile all these letters
    > (which can only result in what I describe as "visual seasickess" to
    > any poor users who have to actually read such wordlists.

    Again, no evidence. Let's look at a particular example, letters based on
    "O". UCA *already* interleaves the list below (UCA O List). Adding John's
    list to that would add only the two elements:

    00F8; LATIN SMALL LETTER O WITH STROKE
    01FF; LATIN SMALL LETTER O WITH STROKE AND ACUTE

    I fail to see your purported user would swamped by the relative magnitude of
    the change, which in the case of O would be adding about 1% more interleaved
    O's. How is this addition going to cause "visual seasickness", I wonder?

    UCA O List
    ====================
     o 006F LATIN SMALL LETTER O
     o FF4F FULLWIDTH LATIN SMALL LETTER O
     ◌ͦ 0366 COMBINING LATIN SMALL LETTER O
     ℴ 2134 SCRIPT SMALL O
     𝐨 1D428 MATHEMATICAL BOLD SMALL O
     𝑜 1D45C MATHEMATICAL ITALIC SMALL O
     𝒐 1D490 MATHEMATICAL BOLD ITALIC SMALL O
     𝓸 1D4F8 MATHEMATICAL BOLD SCRIPT SMALL O
     𝔬 1D52C MATHEMATICAL FRAKTUR SMALL O
     𝕠 1D560 MATHEMATICAL DOUBLE-STRUCK SMALL O
     𝖔 1D594 MATHEMATICAL BOLD FRAKTUR SMALL O
     𝗈 1D5C8 MATHEMATICAL SANS-SERIF SMALL O
     𝗼 1D5FC MATHEMATICAL SANS-SERIF BOLD SMALL O
     𝘰 1D630 MATHEMATICAL SANS-SERIF ITALIC SMALL O
     𝙤 1D664 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL O
     𝚘 1D698 MATHEMATICAL MONOSPACE SMALL O
     ⓞ 24DE CIRCLED LATIN SMALL LETTER O
     O 004F LATIN CAPITAL LETTER O
     O FF2F FULLWIDTH LATIN CAPITAL LETTER O
     𝐎 1D40E MATHEMATICAL BOLD CAPITAL O
     𝑂 1D442 MATHEMATICAL ITALIC CAPITAL O
     𝑶 1D476 MATHEMATICAL BOLD ITALIC CAPITAL O
     𝒪 1D4AA MATHEMATICAL SCRIPT CAPITAL O
     𝓞 1D4DE MATHEMATICAL BOLD SCRIPT CAPITAL O
     𝔒 1D512 MATHEMATICAL FRAKTUR CAPITAL O
     𝕆 1D546 MATHEMATICAL DOUBLE-STRUCK CAPITAL O
     𝕺 1D57A MATHEMATICAL BOLD FRAKTUR CAPITAL O
     𝖮 1D5AE MATHEMATICAL SANS-SERIF CAPITAL O
     𝗢 1D5E2 MATHEMATICAL SANS-SERIF BOLD CAPITAL O
     𝘖 1D616 MATHEMATICAL SANS-SERIF ITALIC CAPITAL O
     𝙊 1D64A MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL O
     𝙾 1D67E MATHEMATICAL MONOSPACE CAPITAL O
     Ⓞ 24C4 CIRCLED LATIN CAPITAL LETTER O
     º 00BA MASCULINE ORDINAL INDICATOR
     ᴼ 1D3C MODIFIER LETTER CAPITAL O
     ᵒ 1D52 MODIFIER LETTER SMALL O
     ó 00F3 LATIN SMALL LETTER O WITH ACUTE
     Ó 00D3 LATIN CAPITAL LETTER O WITH ACUTE
     ò 00F2 LATIN SMALL LETTER O WITH GRAVE
     Ò 00D2 LATIN CAPITAL LETTER O WITH GRAVE
     ŏ 014F LATIN SMALL LETTER O WITH BREVE
     Ŏ 014E LATIN CAPITAL LETTER O WITH BREVE
     ô 00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX
     Ô 00D4 LATIN CAPITAL LETTER O WITH CIRCUMFLEX
     ố 1ED1 LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE
     Ố 1ED0 LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND ACUTE
     ồ 1ED3 LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE
     Ồ 1ED2 LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND GRAVE
     ỗ 1ED7 LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE
     Ỗ 1ED6 LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND TILDE
     ổ 1ED5 LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE
     Ổ 1ED4 LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE
     ǒ 01D2 LATIN SMALL LETTER O WITH CARON
     Ǒ 01D1 LATIN CAPITAL LETTER O WITH CARON
     ö 00F6 LATIN SMALL LETTER O WITH DIAERESIS
     Ö 00D6 LATIN CAPITAL LETTER O WITH DIAERESIS
     ȫ 022B LATIN SMALL LETTER O WITH DIAERESIS AND MACRON
     Ȫ 022A LATIN CAPITAL LETTER O WITH DIAERESIS AND MACRON
     ő 0151 LATIN SMALL LETTER O WITH DOUBLE ACUTE
     Ő 0150 LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
     õ 00F5 LATIN SMALL LETTER O WITH TILDE
     Õ 00D5 LATIN CAPITAL LETTER O WITH TILDE
     ṍ 1E4D LATIN SMALL LETTER O WITH TILDE AND ACUTE
     Ṍ 1E4C LATIN CAPITAL LETTER O WITH TILDE AND ACUTE
     ṏ 1E4F LATIN SMALL LETTER O WITH TILDE AND DIAERESIS
     Ṏ 1E4E LATIN CAPITAL LETTER O WITH TILDE AND DIAERESIS
     ȭ 022D LATIN SMALL LETTER O WITH TILDE AND MACRON
     Ȭ 022C LATIN CAPITAL LETTER O WITH TILDE AND MACRON
     ȯ 022F LATIN SMALL LETTER O WITH DOT ABOVE
     Ȯ 022E LATIN CAPITAL LETTER O WITH DOT ABOVE
     ȱ 0231 LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON
     Ȱ 0230 LATIN CAPITAL LETTER O WITH DOT ABOVE AND MACRON
     ǫ 01EB LATIN SMALL LETTER O WITH OGONEK
     Ǫ 01EA LATIN CAPITAL LETTER O WITH OGONEK
     ǭ 01ED LATIN SMALL LETTER O WITH OGONEK AND MACRON
     Ǭ 01EC LATIN CAPITAL LETTER O WITH OGONEK AND MACRON
     ō 014D LATIN SMALL LETTER O WITH MACRON
     Ō 014C LATIN CAPITAL LETTER O WITH MACRON
     ṓ 1E53 LATIN SMALL LETTER O WITH MACRON AND ACUTE
     Ṓ 1E52 LATIN CAPITAL LETTER O WITH MACRON AND ACUTE
     ṑ 1E51 LATIN SMALL LETTER O WITH MACRON AND GRAVE
     Ṑ 1E50 LATIN CAPITAL LETTER O WITH MACRON AND GRAVE
     ỏ 1ECF LATIN SMALL LETTER O WITH HOOK ABOVE
     Ỏ 1ECE LATIN CAPITAL LETTER O WITH HOOK ABOVE
     ȍ 020D LATIN SMALL LETTER O WITH DOUBLE GRAVE
     Ȍ 020C LATIN CAPITAL LETTER O WITH DOUBLE GRAVE
     ȏ 020F LATIN SMALL LETTER O WITH INVERTED BREVE
     Ȏ 020E LATIN CAPITAL LETTER O WITH INVERTED BREVE
     ơ 01A1 LATIN SMALL LETTER O WITH HORN
     Ơ 01A0 LATIN CAPITAL LETTER O WITH HORN
     ớ 1EDB LATIN SMALL LETTER O WITH HORN AND ACUTE
     Ớ 1EDA LATIN CAPITAL LETTER O WITH HORN AND ACUTE
     ờ 1EDD LATIN SMALL LETTER O WITH HORN AND GRAVE
     Ờ 1EDC LATIN CAPITAL LETTER O WITH HORN AND GRAVE
     ỡ 1EE1 LATIN SMALL LETTER O WITH HORN AND TILDE
     Ỡ 1EE0 LATIN CAPITAL LETTER O WITH HORN AND TILDE
     ở 1EDF LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE
     Ở 1EDE LATIN CAPITAL LETTER O WITH HORN AND HOOK ABOVE
     ợ 1EE3 LATIN SMALL LETTER O WITH HORN AND DOT BELOW
     Ợ 1EE2 LATIN CAPITAL LETTER O WITH HORN AND DOT BELOW
     ọ 1ECD LATIN SMALL LETTER O WITH DOT BELOW
     Ọ 1ECC LATIN CAPITAL LETTER O WITH DOT BELOW
     ộ 1ED9 LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW
     Ộ 1ED8 LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND DOT BELOW
    ====================

    >
    > 6) the Latin alphabet has a lot more than 26 letters in it. In this
    > age of the Universal Character Set, "most users" would do better to
    > get used to this than to be hobbled by older concepts.

    I agree with the general principle, but it has no bearing on the topic at
    hand.

    > --
    > Michael Everson * * Everson Typography * * http://www.evertype.com
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Jul 09 2004 - 14:29:05 CDT