Mark Davis, 2004-07-26
The document contains proposals for changes in UCA for the weightings of a small number of Latin characters (out 1870 entries in the UCA for Latin).
It is important that we ensure that the UCA weightings are as good as possible. Collation has application not merely in presenting lists of sorted strings to users, but also in database queries, and in language-sensitive matching. It is crucial, for example, that user expectations are met for how ordering concerns. Consider the example of a German businessman making a database selection, such as to sum up revenue in each of of the cities from O... to P... for planning purposes. If behind his back all cities starting with Ö are excluded because the query selection is using a Swedish collation, there is going to be one very unhappy customer. Similarly, sorting "Søren" after "Sozar" in a long list — if that is not expected in the user's language — will cause problems. A user will look for "Søren"between "Sorem" and "Soret", not see it on the page, and assume it isn't there; fooled by the fact that it is on a completely different page. In matching, the same can occur, which can have cause significant problems for software customers; and as with database selection, the user may not realize what he is missing.
With Unicode being deployed so widely, this is even more important; multilingual data becomes the rule, not the exception. A French company with customers all over Europe is going to have names from many different languages — French, German, Polish, Swedish, etc. If a German employee sets the sorting (or matching/selecting) language to be German, then the names need to show up in the order appropriate for German, even though there will be many different accented characters that do not normally appear in German text.
Stability is important, and we want to consider changes carefully. However, we know that if the UCA is changed in any way, then any tailoring is affected in that it will produce a different ordering for some characters (any that it does not explicitly override). So any implementation's versioning scheme must take account of this. This will always be the case, unless we completely freeze the UCA, disallowing fixes for, say, Indic characters. But the UTC clearly has not agreed to do freeze UCA; while stability is very important, we have left ourselves the ability to make changes in the UCA when warranted.
And the only tailorings that would be affected for the worse are ones where the tailoring depends
on inheriting the order from the UCA for the affected characters. In a great many of these cases,
the UCA order must be tailored anyway for any of these characters that are needed in the languages.
For example, Ø must be tailored for Danish (da.xml).
Note that in CLDR, we explicitly do not depend on the UCA ordering for the following characters when
they are considered separate letters in the language; for example, in Polish you will see explicit
weighting of Ł (pl.xml).
The character Æ (and its lowercase) should sort with a primary weight of AE, just like Œ sorts with a primary weight of OE currently.
Æ | 00C6 | LATIN CAPITAL LETTER AE | |
Ǽ | 01FC | LATIN CAPITAL LETTER AE WITH ACUTE | |
Ǣ | 01E2 | LATIN CAPITAL LETTER AE WITH MACRON |
Traditionally, except for a very few languages, Æ is considered to be a presentation variant of AE. You see that in variation in representation between words like hæmophilia and haemophilia, cæsium and caesium. In some languages (or variants, like American English*), the spelling has been reformed to convert to 'e' or another letter. But where the vast majority of people see Æ, they will consider it to behave like AE.
* Note: of course, there may be more dramatic respellings than the American one, such as: caeisiam, cæsium, cäsiumn,cesi, cesio, césio, cesiom, cesiun, cesium, cesium, césiumm, cesiumn, cesiwm, cesyum, céz, cezm, cezi', cezij, cēzijs, cezio, cezis, cezium, cézium, kaishum, sesín, sesium, seziom, sezyum, siżjum, tseesium, tseziumu, xezi,xêzi, zäsium, zäsiumn, zesioa.
The following characters (and their lowercases) should be made secondary differences from their bases in UCA 4.1. They are arranged in rough priority order, based on frequency of usage. The UCA should change at least the first group, although all of them are recommended.
Characters | Languages on http://www.eki.ee/letter/ | |||
---|---|---|---|---|
Ø | 00D8 | LATIN CAPITAL LETTER O WITH STROKE | da [Danish]; fo [Faroese]; kl [Greenlandic]; no [Norwegian]; | |
Ǿ | 01FE | LATIN CAPITAL LETTER O WITH STROKE AND ACUTE | (but included for consistency with O WITH STROKE) | |
Đ | 0110 | LATIN CAPITAL LETTER D WITH STROKE | bs [Bosnian]; hr [Croatian]; sami1 [Inari Sámi]; sami2 [North Sámi]; sami4 [Skolt Sámi]; sl [Slovenian]; vi [Vietnamese]; | |
Ł | 0141 | LATIN CAPITAL LETTER L WITH STROKE | pl [Polish]; sorb1 [Lower Sorbian]; sorb2 [Upper Sorbian]; sla [Kashubian]; | |
Ŀ | 013F | LATIN CAPITAL LETTER L WITH MIDDLE DOT | ca [Catalan]; | |
Ð | 00D0 | LATIN CAPITAL LETTER ETH | fo [Faroese]; is [Icelandic]; | |
Ħ | 0126 | LATIN CAPITAL LETTER H WITH STROKE | mt [Maltese]; | |
Ŧ | 0166 | LATIN CAPITAL LETTER T WITH STROKE | sami2 [North Sámi]; | |
Ǥ | 01E4 | LATIN CAPITAL LETTER G WITH STROKE | ||
Ŋ | 014A | LATIN CAPITAL LETTER ENG | bm [Bambara]; ff [Fula]; sami1 [Inari Sámi]; sami2 [North Sámi]; sami4 [Skolt Sámi]; wo [Wolof]; dink [Dinka]; | |
Ɓ | 0181 | LATIN CAPITAL LETTER B WITH HOOK | ha [Hausa]; ff [Fula]; or bm [Bambara]; | |
Ɗ | 018A | LATIN CAPITAL LETTER D WITH HOOK | ||
Ƙ | 0198 | LATIN CAPITAL LETTER K WITH HOOK | ||
Ɲ | 019D | LATIN CAPITAL LETTER N WITH LEFT HOOK | ||
Ƴ | 01B3 | LATIN CAPITAL LETTER Y WITH HOOK | ||
Ƃ | 0182 | LATIN CAPITAL LETTER B WITH TOPBAR | No information | |
Ƈ | 0187 | LATIN CAPITAL LETTER C WITH HOOK | ||
Ɖ | 0189 | LATIN CAPITAL LETTER AFRICAN D | ||
Ƒ | 0191 | LATIN CAPITAL LETTER F WITH HOOK | ||
Ɠ | 0193 | LATIN CAPITAL LETTER G WITH HOOK | ||
Ɨ | 0197 | LATIN CAPITAL LETTER I WITH STROKE | ||
Ƞ | 0220 | LATIN CAPITAL LETTER N WITH LONG RIGHT LEG | ||
Ƥ | 01A4 | LATIN CAPITAL LETTER P WITH HOOK | ||
Ƭ | 01AC | LATIN CAPITAL LETTER T WITH HOOK | ||
Ʈ | 01AE | LATIN CAPITAL LETTER T WITH RETROFLEX HOOK | ||
Ʋ | 01B2 | LATIN CAPITAL LETTER V WITH HOOK | ||
Ƶ | 01B5 | LATIN CAPITAL LETTER Z WITH STROKE | ||
Ȥ | 0224 | LATIN CAPITAL LETTER Z WITH HOOK |
Users don't distinguish between types of accents. They do not understand why the default ordering of LATIN CAPITAL LETTER I WITH OGONEK makes it sort with I, while the default ordering of LATIN CAPITAL LETTER Z WITH HOOK makes it sort as a completely separate letter than Z.
Į | 012E | LATIN CAPITAL LETTER I WITH OGONEK | |
Ȥ | 0224 | LATIN CAPITAL LETTER Z WITH HOOK |
Even where a language distinguishes certain accented letters as separate letters for collation/matching, they expect letters to be treated uniformly. In Polish letters with diacritics Ą Ć Ę Ł Ń Ó Ś Ź Ż are sorted after the corresponding letters without. Querying Polish users, they will expect them either to be all separate letters, or for them all to be sorted with their base: they see no reason for singling out Ł for different treatment than the others.
And if a German customer is accessing a database full of European names, and expects to find Ę with E, and Ą with A and Ż with Z and Ł with L, then he will be right except for the last one with the current UCA. If s/he expects that a database SELECT of all client names starting with "L" will include the "Ł" names also, then s/he will get the wrong answer in a financial report — probably not realizing it is wrong. If s/he looks for a client name Słownik* within a page of Sl... and doesn't think to look 3 pages down after Sz, then s/he will get the wrong answer — probably not realizing it is wrong. If s/he searches for a name within a body of text using a weak language-sensitive match, and doesn't find it, then s/he will get the wrong answer — probably not realizing it is wrong.
Again we see the same pattern of behavior:
Q. Doesn't this propose to reverse the explicit design principles that went into the default tailorable template in the first place. Similar letters are near — but not interfiled with — similar letters. This is more than enough to give any casual user the functionality he needs, because only in initial position is there likely to be any confusion in real-life sorted word lists.
A. What we actually did was to put similar letters near other letters, and if their
decompositions were the
same we interfiled them. To users, however, there is little difference between Å, Ł , Ļ , Ñ, Ø, Ơ,
and Ô that would cause a user to think that the some should be interfiled and some should not. Å is
seen as a separate letter in the languages that use it, but UCA "interfiles" it. Ł is also seen as a
separate letter, and UCA doesn't. In some languages these would be seen as "separate letters" (e.g.
with different primary weights) and in others not; but that does not line up in any particular way
with what is in the UCA.
And making it a primary vs secondary difference can have some important consequences; not all sorted elements are very small lists, with all affected characters within a few lines of each other on a single page, where placement doesn't matter too much. This doesn't work with large lists, database selection, matching (where I won't see that I am missing something), etc.
Q. O-slash is treated as a separate letter in the pronunciation guides of all IPA-based dictionaries, which constitute the majority of the world's usage, currently. So shouldn't it be left as a "separate letter"?
A. First, we don't know that UCA out of the box sorts IPA correctly — nor do we have much of an idea what constitutes the "correct" IPA sorting. The IPA specification itself does not appear to have any sorting requirement. Secondly, even in dictionaries, the entries are not normally sorted by the IPA, they are sorted by the words that the IPA is glossing. Thirdly — and much more importantly — the amount of sorted IPA data is going to be dwarfed by the amount of data sorted according to normal language conventions.
The fact that IPA uses these letters as being different is completely aside from the
point. Everyone agrees that for that purpose they are different characters: Å and A are different
characters, but interleaved in UCA; Ł and L are different characters, but not interleaved in UCA.
Q. Won't this produce a visually disturbing effect, as in the following?
Interleaved (Recommendation) | Separate (Current UCA) | |
---|---|---|
1 | ofofofo oføfofo øfofofo øfoføfø ofofofp |
ofofofo ofofofp oføfofo øfofofo øfoføfø |
This is an curious perception, since this is only one case out of 102 accented o's, where all the others are interleaved. And of course visual disturbance of multiple characters in such artificial examples with multiple marks has little to do with sorting/matching behavior.
Interleaved (Current UCA) | ||
---|---|---|
2 | ofofofo ofơfofo ơfofofo ơfofơfơ ofofofp |
|
3 | ofofofo ofõfofo õfofofo õfofõfõ ofofofp |
|
... |