UCA Revised Latin?

L2/04-031

Re:	UCA Revised Latin?
From:	Mark Davis
Date:	2004-01-23

We should consider whether or not to do the following changes to the next version of the UCA.

[For the meeting, please also print http://www.unicode.org/charts/collation/chart_Latin.html]

1. Make alternate forms of letters (like the following) be secondary differences from the 'base' letter.

a	ɐ `0250`	ɑ `0251`	ɒ `0252`
b	ʙ `0299`	ƀ `0180`	ɓ `0253`	Ɓ `0181`	ƃ `0183`	Ƃ `0182`
c	ƈ `0188`	Ƈ `0187`	ɕ `0255`
d	đ `0111`	Đ `0110`	ɖ `0256`	Ɖ `0189`	ɗ `0257`	Ɗ `018A`	ƌ `018C`	Ƌ `018B`	ð `00F0`	Ð `00D0`	ƍ `018D`
etc.

Pros:
1. If a language does not use those letters, they would be expected to be ordered as variants of a base. For example, a non-Scandinavian user would expect to see ø as a variant of o, and not have the ordering:
  1. sos...
  2. sot...
  3. sou...
    ...
  4. søs...
2. If a language does use those letters, they are very likely tailored someplace else anyway.
3. When a tailoring inserts letters, it is typically after the base. Suppose for example, that a language sorts t as primary-greater than d. Without special consideration for the variant forms, what a user would see is:
  1. sod...
  2. sot...
  3. sođ...
  4. soɖ...
4. Instead of what the user would expect:
  1. sod...
  2. sođ...
  3. soɖ...
  4. sot...
5. better compatibility with the European ordering rules (http://anubis.dkuug.dk/CEN/TC304/EOR/eor4r.pdf), for letters that are in the repertoire
Cons:
1. stability -- not a small con, so we need to consider it carefully!

Outliers: the following appear unrelated to the 'base' letter that they are after (in UCA order), so should be left where they are.

Ƣ
01A2 ƣ
01A3 ɤ
0264 etc.

2. Make "æ" be a secondary difference from "ae".

Pros:
1. consistency with the handling of "œ"
2. currently all Latin languages have to tailor this character. Certain Scandinavian languages will tailor it to be a letter above z. All other languages would tailor it to be a secondary (or tertiary) difference from ae, to reflect alternate spellings like Cæsar or hæmoglobin.
3. better compatibility with the European ordering rules (http://anubis.dkuug.dk/CEN/TC304/EOR/eor4r.pdf)
Cons:
1. stability

For reference, here is an email related to the topic.

> ----- Original Message -----
> From: Åke Persson
> To: Mark Davis
> Sent: Wed, 2003 Dec 31 06:36
> Subject: ae << æ etc.
>
> Mark,
>
> I have browsed the latest ICU collations. Here are a few comments.
>
> The inclusion of ae << æ in several languages resembles my experience when I
> implemented the UCA in Mimer SQL. The next thing that came up was letters with
> stroke. For example, the Polish letter L-stroke, properly used in Polish names,
> did not match a Swedish or English search for names containing L. L-stoke is
> expected to be L with a stroke "accent", except for Polish (and Sorbian).
> <<Lodz.jpg>> is a snapshot from a Swedish encyclopædia (note also "oe"). To make
> a long story short, it all ended up in the European Ordering Rules (EOR)
> concept, where the base letters in the latin alphabet are only A-Z. The first
> step was to create an EOR-tailoring as the base. Languages, with additional
> letters in their alphabet, was tailored on top of the EOR tailoring. The next
> step was improvement of space and performance, by making EOR the default, and to
> create a tailoring for the default UCA instead (at least needed for the
> conformance test).
>
> Here's an overview of the tailorings:
> http://developer.mimer.com/collations/charts/tailorings.htm
>
> Please, take a closer look at:
> Catalan, Croatian, Faroese, Icelandic, Latvian, Lithuanian, Romanian, and Slovak
> compared to the corresponding ICU collations.
>
> My sources are documented here:
> http://developer.mimer.com/collations/charts/sources.htm
>
> The E-ogonek (old Sami and Icelandic Ä) as a variant of Ä in Faroese, Finnish,
> Greenlandic, Norwegian, and Swedish looks a bit goofy. I would rather expect a
> search match for E in Polish and Lithuanian names containing E-ogonek. I think
> it's better to have a specific locale for Sami.
>
> [before 1] is used extensivly in the ICU collations. It's easier to read the
> collation definitions, if [before 1] is used only when necessary.
>
> Happy New Year!
> Åke Persson