L2/04-311

UCA Latin Recommendations

Mark Davis, 2004-07-26

The document contains proposals for changes in UCA for the weightings of a small number of Latin characters (out 1870 entries in the UCA for Latin).

It is important that we ensure that the UCA weightings are as good as possible. Collation has application not merely in presenting lists of sorted strings to users, but also in database queries, and in language-sensitive matching. It is crucial, for example, that user expectations are met for how ordering concerns. Consider the example of a German businessman making a database selection, such as to sum up revenue in each of of the cities from O... to P... for planning purposes. If behind his back all cities starting with Ö are excluded because the query selection is using a Swedish collation, there is going to be one very unhappy customer. Similarly, sorting "Søren" after "Sozar" in a long list — if that is not expected in the user's language — will cause problems. A user will look for "Søren"between "Sorem" and "Soret", not see it on the page, and assume it isn't there; fooled by the fact that it is on a completely different page. In matching, the same can occur, which can have cause significant problems for software customers; and as with database selection, the user may not realize what he is missing.

With Unicode being deployed so widely, this is even more important; multilingual data becomes the rule, not the exception. A French company with customers all over Europe is going to have names from many different languages — French, German, Polish, Swedish, etc. If a German employee sets the sorting (or matching/selecting) language to be German, then the names need to show up in the order appropriate for German, even though there will be many different accented characters that do not normally appear in German text.

Stability is important, and we want to consider changes carefully. However, we know that if the UCA is changed in any way, then any tailoring is affected in that it will produce a different ordering for some characters (any that it does not explicitly override). So any implementation's versioning scheme must take account of this. This will always be the case, unless we completely freeze the UCA, disallowing fixes for, say, Indic characters. But the UTC clearly has not agreed to do freeze UCA; while stability is very important, we have left ourselves the ability to make changes in the UCA when warranted.

And the only tailorings that would be affected for the worse are ones where the tailoring depends on inheriting the order from the UCA for the affected characters. In a great many of these cases, the UCA order must be tailored anyway for any of these characters that are needed in the languages. For example, Ø must be tailored for Danish (da.xml). Note that in CLDR, we explicitly do not depend on the UCA ordering for the following characters when they are considered separate letters in the language; for example, in Polish you will see explicit weighting of Ł (pl.xml).
 

1. Changing Æ to be an expansion

The character Æ (and its lowercase) should sort with a primary weight of AE, just like Œ sorts with a primary weight of OE currently.

00C6 Æ 00C6 LATIN CAPITAL LETTER AE
01FC Ǽ 01FC LATIN CAPITAL LETTER AE WITH ACUTE
01E2 Ǣ 01E2 LATIN CAPITAL LETTER AE WITH MACRON

Traditionally, except for a very few languages, Æ is considered to be a presentation variant of AE. You see that in variation in representation between words like hæmophilia and haemophilia, cæsium and caesium. In some languages (or variants, like American English*), the spelling has been reformed to convert to 'e' or another letter. But where the vast majority of people see Æ, they will consider it to behave like AE.

* Note: of course, there may be more dramatic respellings than the American one, such as: caeisiam, cæsium, cäsiumn,cesi, cesio, césio, cesiom, cesiun, cesium, cesium, césiumm, cesiumn, cesiwm, cesyum, céz, cezm, cezi', cezij, cēzijs, cezio, cezis, cezium, cézium, kaishum, sesín, sesium, seziom, sezyum, siżjum, tseesium, tseziumu, xezi,xêzi, zäsium, zäsiumn, zesioa.

2. Changing characters with diacritics to secondary difference

The following characters (and their lowercases) should be made secondary differences from their bases in UCA 4.1. They are arranged in rough priority order, based on frequency of usage. The UCA should change at least the first group, although all of them are recommended.

Characters Languages on http://www.eki.ee/letter/
00D8 Ø 00D8 LATIN CAPITAL LETTER O WITH STROKE da [Danish]; fo [Faroese]; kl [Greenlandic]; no [Norwegian];
01FE Ǿ 01FE LATIN CAPITAL LETTER O WITH STROKE AND ACUTE (but included for consistency with O WITH STROKE)
0110 Đ 0110 LATIN CAPITAL LETTER D WITH STROKE bs [Bosnian]; hr [Croatian]; sami1 [Inari Sámi]; sami2 [North Sámi]; sami4 [Skolt Sámi]; sl [Slovenian]; vi [Vietnamese];
0141 Ł 0141 LATIN CAPITAL LETTER L WITH STROKE pl [Polish]; sorb1 [Lower Sorbian]; sorb2 [Upper Sorbian]; sla [Kashubian];
013F Ŀ 013F LATIN CAPITAL LETTER L WITH MIDDLE DOT ca [Catalan];
00D0 Ð 00D0 LATIN CAPITAL LETTER ETH fo [Faroese]; is [Icelandic];
0126 Ħ 0126 LATIN CAPITAL LETTER H WITH STROKE mt [Maltese];
0166 Ŧ 0166 LATIN CAPITAL LETTER T WITH STROKE sami2 [North Sámi];
01E4 Ǥ 01E4 LATIN CAPITAL LETTER G WITH STROKE
014A Ŋ 014A LATIN CAPITAL LETTER ENG bm [Bambara]; ff [Fula]; sami1 [Inari Sámi]; sami2 [North Sámi]; sami4 [Skolt Sámi]; wo [Wolof]; dink [Dinka];
0181 Ɓ 0181 LATIN CAPITAL LETTER B WITH HOOK ha [Hausa]; ff [Fula]; or bm [Bambara];
018A Ɗ 018A LATIN CAPITAL LETTER D WITH HOOK
0198 Ƙ 0198 LATIN CAPITAL LETTER K WITH HOOK
019D Ɲ 019D LATIN CAPITAL LETTER N WITH LEFT HOOK
01B3 Ƴ 01B3 LATIN CAPITAL LETTER Y WITH HOOK
0182 Ƃ 0182 LATIN CAPITAL LETTER B WITH TOPBAR No information
0187 Ƈ 0187 LATIN CAPITAL LETTER C WITH HOOK
0189 Ɖ 0189 LATIN CAPITAL LETTER AFRICAN D
0191 Ƒ 0191 LATIN CAPITAL LETTER F WITH HOOK
0193 Ɠ 0193 LATIN CAPITAL LETTER G WITH HOOK
0197 Ɨ 0197 LATIN CAPITAL LETTER I WITH STROKE
0220 Ƞ 0220 LATIN CAPITAL LETTER N WITH LONG RIGHT LEG
01A4 Ƥ 01A4 LATIN CAPITAL LETTER P WITH HOOK
01AC Ƭ 01AC LATIN CAPITAL LETTER T WITH HOOK
01AE Ʈ 01AE LATIN CAPITAL LETTER T WITH RETROFLEX HOOK
01B2 Ʋ 01B2 LATIN CAPITAL LETTER V WITH HOOK
01B5 Ƶ 01B5 LATIN CAPITAL LETTER Z WITH STROKE
0224 Ȥ 0224 LATIN CAPITAL LETTER Z WITH HOOK

Users don't distinguish between types of accents. They do not understand why the default ordering of LATIN CAPITAL LETTER I WITH OGONEK makes it sort with I, while the default ordering of LATIN CAPITAL LETTER Z WITH HOOK makes it sort as a completely separate letter than Z.

012E Į 012E LATIN CAPITAL LETTER I WITH OGONEK
0224 Ȥ 0224 LATIN CAPITAL LETTER Z WITH HOOK

Even where a language distinguishes certain accented letters as separate letters for collation/matching, they expect letters to be treated uniformly. In Polish letters with diacritics Ą Ć Ę Ł Ń Ó Ś Ź Ż are sorted after the corresponding letters without. Querying Polish users, they will expect them either to be all separate letters, or for them all to be sorted with their base: they see no reason for singling out Ł for different treatment than the others.

And if a German customer is accessing a database full of European names, and expects to find Ę with E, and Ą with A and Ż with Z and Ł with L, then he will be right except for the last one with the current UCA. If s/he expects that a database SELECT of all client names starting with "L" will include the "Ł" names also, then s/he will get  the wrong answer in a financial report — probably not realizing it is wrong. If s/he looks for a client name Słownik* within a page of Sl... and doesn't think to look 3 pages down after Sz, then s/he will get the wrong answer — probably not realizing it is wrong. If s/he searches for a name within a body of text using a weak language-sensitive match, and doesn't find it, then s/he will get the wrong answer — probably not realizing it is wrong.

Again we see the same pattern of behavior:

Q& A

Q. Doesn't this propose to reverse the explicit design principles that went into the default tailorable template in the first place. Similar letters are near — but not interfiled with — similar letters. This is more than enough to give any casual user the functionality he needs, because only in initial position is there likely to be any confusion in real-life sorted word lists.

A. What we actually did was to put similar letters near other letters, and if their decompositions were the
same we interfiled them. To users, however, there is little difference between Å, Ł , Ļ , Ñ, Ø, Ơ, and Ô that would cause a user to think that the some should be interfiled and some should not. Å is seen as a separate letter in the languages that use it, but UCA "interfiles" it. Ł is also seen as a separate letter, and UCA doesn't. In some languages these would be seen as "separate letters" (e.g. with different primary weights) and in others not; but that does not line up in any particular way with what is in the UCA.

And making it a primary vs secondary difference can have some important consequences; not all sorted elements are very small lists, with all affected characters within a few lines of each other on a single page, where placement doesn't matter too much. This doesn't work with large lists, database selection, matching (where I won't see that I am missing something), etc.

Q. O-slash is treated as a separate letter in the pronunciation guides of all IPA-based dictionaries, which constitute the majority of the world's usage, currently. So shouldn't it be left as a "separate letter"?

A. First, we don't know that UCA out of the box sorts IPA correctly — nor do we have much of an idea what constitutes the "correct" IPA sorting. The IPA specification itself does not appear to have any sorting requirement. Secondly, even in dictionaries, the entries are not normally sorted by the IPA, they are sorted by the words that the IPA is glossing. Thirdly — and much more importantly — the amount of sorted IPA data is going to be dwarfed by the amount of data sorted according to normal language conventions.

The fact that IPA uses these letters as being different is completely aside from the point. Everyone agrees that for that purpose they are different characters: Å and A are different characters, but interleaved in UCA; Ł and L are different characters, but not interleaved in UCA.

Q. Won't this produce a visually disturbing effect, as in the following?

  Interleaved (Recommendation) Separate (Current UCA)
1 ofofofo
oføfofo
øfofofo
øfoføfø
ofofofp
ofofofo
ofofofp
oføfofo
øfofofo
øfoføfø

This is an curious perception, since this is only one case out of 102 accented o's, where all the others are interleaved. And of course visual disturbance of multiple characters in such artificial examples with multiple marks has little to do with sorting/matching behavior.

  Interleaved (Current UCA)  
2 ofofofo
ofơfofo
ơfofofo
ơfofơfơ
ofofofp
 
3 ofofofo
ofõfofo
õfofofo
õfofõfõ
ofofofp
 
  ...