L2/04-031R
Re: |
UCA Revised Latin? |
From: |
Mark Davis |
Date: |
2004-06-17 |
We should consider whether or not to do the following changes to the next version of the UCA.
1. Make alternate forms of letters be secondary differences from the 'base' letter. For example, the following would all be primary equivalents, and only differ on the
secondary level.
- U+0062 (b) LATIN SMALL LETTER B
- U+0299 (ʙ) LATIN LETTER SMALL CAPITAL B
- U+0180 (ƀ) LATIN SMALL LETTER B WITH STROKE
- U+1D2F (ᴯ) MODIFIER LETTER CAPITAL BARRED B
- U+1D03 (ᴃ) LATIN LETTER SMALL CAPITAL BARRED B
- U+0253 (ɓ) LATIN SMALL LETTER B WITH HOOK
- U+0183 (ƃ) LATIN SMALL LETTER B WITH TOPBAR
Pros:
- If a language does not use those letters, they would be expected to be ordered as variants of a base. For example, a non-Scandinavian user would expect to see ø as a variant
of o, and not have the ordering:
- sos...
- sot...
- sou...
- søs...
- If a language does use those letters, they are very likely tailored someplace else anyway.
- When a tailoring inserts letters, it is typically after the base. Suppose for example, that a language sorts t as primary-greater than d. Without
special consideration for the variant forms, what a user would see is:
- sod...
- sot...
- sođ...
Instead of what the user would expect:
- sod...
- sođ...
- sot...
- Better compatibility with the European ordering rules (http://anubis.dkuug.dk/CEN/TC304/EOR/eor4r.pdf), for
letters that are in the repertoire
Cons:
- stability -- not a small con, so we need to consider it carefully!
2. Make "æ" be a secondary difference from "ae".
Pros:
- consistency with the handling of "œ"
- currently all Latin languages have to tailor this character. Certain Scandinavian languages will tailor it to be a letter above z. All other languages would tailor it to be a
secondary (or tertiary) difference from ae, to reflect alternate spellings like Cæsar or hæmoglobin.
- better compatibility with the European ordering rules (http://anubis.dkuug.dk/CEN/TC304/EOR/eor4r.pdf)
Cons:
- stability
For reference, here is an email related to the topic.
> ----- Original Message -----
> From: Åke Persson
> To: Mark Davis
> Sent: Wed, 2003 Dec 31 06:36
> Subject: ae << æ etc.
>
> Mark,
>
> I have browsed the latest ICU collations. Here are a few comments.
>
> The inclusion of ae << æ in several languages resembles my experience when I
> implemented the UCA in Mimer SQL. The next thing that came up was letters with
> stroke. For example, the Polish letter L-stroke, properly used in Polish names,
> did not match a Swedish or English search for names containing L. L-stoke is
> expected to be L with a stroke "accent", except for Polish (and Sorbian).
> <<Lodz.jpg>> is a snapshot from a Swedish encyclopædia (note also "oe"). To make
> a long story short, it all ended up in the European Ordering Rules (EOR)
> concept, where the base letters in the latin alphabet are only A-Z. The first
> step was to create an EOR-tailoring as the base. Languages, with additional
> letters in their alphabet, was tailored on top of the EOR tailoring. The next
> step was improvement of space and performance, by making EOR the default, and to
> create a tailoring for the default UCA instead (at least needed for the
> conformance test).
>
> Here's an overview of the tailorings:
> http://developer.mimer.com/collations/charts/tailorings.htm
>
> Please, take a closer look at:
> Catalan, Croatian, Faroese, Icelandic, Latvian, Lithuanian, Romanian, and Slovak
> compared to the corresponding ICU collations.
>
> My sources are documented here:
> http://developer.mimer.com/collations/charts/sources.htm
>
> The E-ogonek (old Sami and Icelandic Ä) as a variant of Ä in Faroese, Finnish,
> Greenlandic, Norwegian, and Swedish looks a bit goofy. I would rather expect a
> search match for E in Polish and Lithuanian names containing E-ogonek. I think
> it's better to have a specific locale for Sami.
>
> [before 1] is used extensively in the ICU collations. It's easier to read the
> collation definitions, if [before 1] is used only when necessary.
>
> Happy New Year!
> Åke Persson
Here are the Latin primary-different characters, for comparison (all non-primary-different characters have been suppressed in this list).
- U+0061 (a) LATIN SMALL LETTER A
- U+1D00 (ᴀ) LATIN LETTER SMALL CAPITAL A
- U+00E6 (æ) LATIN SMALL LETTER AE
- U+1D01 (ᴁ) LATIN LETTER SMALL CAPITAL AE
- U+1D02 (ᴂ) LATIN SMALL LETTER TURNED AE
- U+0250 (ɐ) LATIN SMALL LETTER TURNED A
- U+0251 (ɑ) LATIN SMALL LETTER ALPHA
- U+0252 (ɒ) LATIN SMALL LETTER TURNED ALPHA
- U+0062 (b) LATIN SMALL LETTER B
- U+0299 (ʙ) LATIN LETTER SMALL CAPITAL B
- U+0180 (ƀ) LATIN SMALL LETTER B WITH STROKE
- U+1D2F (ᴯ) MODIFIER LETTER CAPITAL BARRED B
- U+1D03 (ᴃ) LATIN LETTER SMALL CAPITAL BARRED B
- U+0253 (ɓ) LATIN SMALL LETTER B WITH HOOK
- U+0183 (ƃ) LATIN SMALL LETTER B WITH TOPBAR
- U+0063 (c) LATIN SMALL LETTER C
- U+1D04 (ᴄ) LATIN LETTER SMALL CAPITAL C
- U+0188 (ƈ) LATIN SMALL LETTER C WITH HOOK
- U+0255 (ɕ) LATIN SMALL LETTER C WITH CURL
- U+0064 (d) LATIN SMALL LETTER D
- U+1D05 (ᴅ) LATIN LETTER SMALL CAPITAL D
- U+0111 (đ) LATIN SMALL LETTER D WITH STROKE
- U+0256 (ɖ) LATIN SMALL LETTER D WITH TAIL
- U+0257 (ɗ) LATIN SMALL LETTER D WITH HOOK
- U+018C (ƌ) LATIN SMALL LETTER D WITH TOPBAR
- U+0221 (ȡ) LATIN SMALL LETTER D WITH CURL
- U+00F0 (ð) LATIN SMALL LETTER ETH
- U+1D06 (ᴆ) LATIN LETTER SMALL CAPITAL ETH
- U+0065 (e) LATIN SMALL LETTER E
- U+1D07 (ᴇ) LATIN LETTER SMALL CAPITAL E
- U+01DD (ǝ) LATIN SMALL LETTER TURNED E
- U+0259 (ə) LATIN SMALL LETTER SCHWA
- U+025B (ɛ) LATIN SMALL LETTER OPEN E
- U+0258 (ɘ) LATIN SMALL LETTER REVERSED E
- U+025A (ɚ) LATIN SMALL LETTER SCHWA WITH HOOK
- U+025C (ɜ) LATIN SMALL LETTER REVERSED OPEN E
- U+1D08 (ᴈ) LATIN SMALL LETTER TURNED OPEN E
- U+025D (ɝ) LATIN SMALL LETTER REVERSED OPEN E WITH HOOK
- U+025E (ɞ) LATIN SMALL LETTER CLOSED REVERSED OPEN E
- U+029A (ʚ) LATIN SMALL LETTER CLOSED OPEN E
- U+0264 (ɤ) LATIN SMALL LETTER RAMS HORN
- U+0066 (f) LATIN SMALL LETTER F
- U+0192 (ƒ) LATIN SMALL LETTER F WITH HOOK
- U+0067 (g) LATIN SMALL LETTER G
- U+0261 (ɡ) LATIN SMALL LETTER SCRIPT G
- U+0262 (ɢ) LATIN LETTER SMALL CAPITAL G
- U+01E5 (ǥ) LATIN SMALL LETTER G WITH STROKE
- U+0260 (ɠ) LATIN SMALL LETTER G WITH HOOK
- U+029B (ʛ) LATIN LETTER SMALL CAPITAL G WITH HOOK
- U+0263 (ɣ) LATIN SMALL LETTER GAMMA
- U+01A3 (ƣ) LATIN SMALL LETTER OI
- U+0068 (h) LATIN SMALL LETTER H
- U+029C (ʜ) LATIN LETTER SMALL CAPITAL H
- U+0195 (ƕ) LATIN SMALL LETTER HV
- U+0127 (ħ) LATIN SMALL LETTER H WITH STROKE
- U+0266 (ɦ) LATIN SMALL LETTER H WITH HOOK
- U+0267 (ɧ) LATIN SMALL LETTER HENG WITH HOOK
- U+02BB (ʻ) MODIFIER LETTER TURNED COMMA
- U+02BD (ʽ) MODIFIER LETTER REVERSED COMMA
- U+0069 (i) LATIN SMALL LETTER I
- U+0131 (ı) LATIN SMALL LETTER DOTLESS I
- U+026A (ɪ) LATIN LETTER SMALL CAPITAL I
- U+1D09 (ᴉ) LATIN SMALL LETTER TURNED I
- U+0268 (ɨ) LATIN SMALL LETTER I WITH STROKE
- U+0269 (ɩ) LATIN SMALL LETTER IOTA
- U+006A (j) LATIN SMALL LETTER J
- U+1D0A (ᴊ) LATIN LETTER SMALL CAPITAL J
- U+029D (ʝ) LATIN SMALL LETTER J WITH CROSSED-TAIL
- U+025F (ɟ) LATIN SMALL LETTER DOTLESS J WITH STROKE
- U+0284 (ʄ) LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK
- U+006B (k) LATIN SMALL LETTER K
- U+1D0B (ᴋ) LATIN LETTER SMALL CAPITAL K
- U+0199 (ƙ) LATIN SMALL LETTER K WITH HOOK
- U+029E (ʞ) LATIN SMALL LETTER TURNED K
- U+006C (l) LATIN SMALL LETTER L
- U+029F (ʟ) LATIN LETTER SMALL CAPITAL L
- U+0142 (ł) LATIN SMALL LETTER L WITH STROKE
- U+1D0C (ᴌ) LATIN LETTER SMALL CAPITAL L WITH STROKE
- U+019A (ƚ) LATIN SMALL LETTER L WITH BAR
- U+026B (ɫ) LATIN SMALL LETTER L WITH MIDDLE TILDE
- U+026C (ɬ) LATIN SMALL LETTER L WITH BELT
- U+026D (ɭ) LATIN SMALL LETTER L WITH RETROFLEX HOOK
- U+0234 (ȴ) LATIN SMALL LETTER L WITH CURL
- U+026E (ɮ) LATIN SMALL LETTER LEZH
- U+019B (ƛ) LATIN SMALL LETTER LAMBDA WITH STROKE
- U+028E (ʎ) LATIN SMALL LETTER TURNED Y
- U+006D (m) LATIN SMALL LETTER M
- U+1D0D (ᴍ) LATIN LETTER SMALL CAPITAL M
- U+0271 (ɱ) LATIN SMALL LETTER M WITH HOOK
- U+006E (n) LATIN SMALL LETTER N
- U+0274 (ɴ) LATIN LETTER SMALL CAPITAL N
- U+1D3B (ᴻ) MODIFIER LETTER CAPITAL REVERSED N
- U+1D0E (ᴎ) LATIN LETTER SMALL CAPITAL REVERSED N
- U+0272 (ɲ) LATIN SMALL LETTER N WITH LEFT HOOK
- U+019E (ƞ) LATIN SMALL LETTER N WITH LONG RIGHT LEG
- U+0273 (ɳ) LATIN SMALL LETTER N WITH RETROFLEX HOOK
- U+0235 (ȵ) LATIN SMALL LETTER N WITH CURL
- U+014B (ŋ) LATIN SMALL LETTER ENG
- U+006F (o) LATIN SMALL LETTER O
- U+1D0F (ᴏ) LATIN LETTER SMALL CAPITAL O
- U+1D11 (ᴑ) LATIN SMALL LETTER SIDEWAYS O
- U+0276 (ɶ) LATIN LETTER SMALL CAPITAL OE
- U+1D14 (ᴔ) LATIN SMALL LETTER TURNED OE
- U+00F8 (ø) LATIN SMALL LETTER O WITH STROKE
- U+1D13 (ᴓ) LATIN SMALL LETTER SIDEWAYS O WITH STROKE
- U+0254 (ɔ) LATIN SMALL LETTER OPEN O
- U+1D10 (ᴐ) LATIN LETTER SMALL CAPITAL OPEN O
- U+1D12 (ᴒ) LATIN SMALL LETTER SIDEWAYS OPEN O
- U+1D16 (ᴖ) LATIN SMALL LETTER TOP HALF O
- U+1D17 (ᴗ) LATIN SMALL LETTER BOTTOM HALF O
- U+0275 (ɵ) LATIN SMALL LETTER BARRED O
- U+0277 (ɷ) LATIN SMALL LETTER CLOSED OMEGA
- U+0223 (ȣ) LATIN SMALL LETTER OU
- U+1D15 (ᴕ) LATIN LETTER SMALL CAPITAL OU
- U+0070 (p) LATIN SMALL LETTER P
- U+1D18 (ᴘ) LATIN LETTER SMALL CAPITAL P
- U+01A5 (ƥ) LATIN SMALL LETTER P WITH HOOK
- U+0278 (ɸ) LATIN SMALL LETTER PHI
- U+0071 (q) LATIN SMALL LETTER Q
- U+02A0 (ʠ) LATIN SMALL LETTER Q WITH HOOK
- U+0138 (ĸ) LATIN SMALL LETTER KRA
- U+0072 (r) LATIN SMALL LETTER R
- U+0280 (ʀ) LATIN LETTER SMALL CAPITAL R
- U+1D19 (ᴙ) LATIN LETTER SMALL CAPITAL REVERSED R
- U+0279 (ɹ) LATIN SMALL LETTER TURNED R
- U+1D1A (ᴚ) LATIN LETTER SMALL CAPITAL TURNED R
- U+027A (ɺ) LATIN SMALL LETTER TURNED R WITH LONG LEG
- U+027B (ɻ) LATIN SMALL LETTER TURNED R WITH HOOK
- U+027C (ɼ) LATIN SMALL LETTER R WITH LONG LEG
- U+027D (ɽ) LATIN SMALL LETTER R WITH TAIL
- U+027E (ɾ) LATIN SMALL LETTER R WITH FISHHOOK
- U+027F (ɿ) LATIN SMALL LETTER REVERSED R WITH FISHHOOK
- U+0281 (ʁ) LATIN LETTER SMALL CAPITAL INVERTED R
- U+0073 (s) LATIN SMALL LETTER S
- U+0282 (ʂ) LATIN SMALL LETTER S WITH HOOK
- U+0283 (ʃ) LATIN SMALL LETTER ESH
- U+01AA (ƪ) LATIN LETTER REVERSED ESH LOOP
- U+0285 (ʅ) LATIN SMALL LETTER SQUAT REVERSED ESH
- U+0286 (ʆ) LATIN SMALL LETTER ESH WITH CURL
- U+0074 (t) LATIN SMALL LETTER T
- U+1D1B (ᴛ) LATIN LETTER SMALL CAPITAL T
- U+0167 (ŧ) LATIN SMALL LETTER T WITH STROKE
- U+01AB (ƫ) LATIN SMALL LETTER T WITH PALATAL HOOK
- U+01AD (ƭ) LATIN SMALL LETTER T WITH HOOK
- U+0288 (ʈ) LATIN SMALL LETTER T WITH RETROFLEX HOOK
- U+0236 (ȶ) LATIN SMALL LETTER T WITH CURL
- U+0287 (ʇ) LATIN SMALL LETTER TURNED T
- U+0075 (u) LATIN SMALL LETTER U
- U+1D1C (ᴜ) LATIN LETTER SMALL CAPITAL U
- U+1D1D (ᴝ) LATIN SMALL LETTER SIDEWAYS U
- U+1D1E (ᴞ) LATIN SMALL LETTER SIDEWAYS DIAERESIZED U
- U+1D6B (ᵫ) LATIN SMALL LETTER UE
- U+0289 (ʉ) LATIN SMALL LETTER U BAR
- U+0265 (ɥ) LATIN SMALL LETTER TURNED H
- U+02AE (ʮ) LATIN SMALL LETTER TURNED H WITH FISHHOOK
- U+02AF (ʯ) LATIN SMALL LETTER TURNED H WITH FISHHOOK AND TAIL
- U+026F (ɯ) LATIN SMALL LETTER TURNED M
- U+1D1F (ᴟ) LATIN SMALL LETTER SIDEWAYS TURNED M
- U+0270 (ɰ) LATIN SMALL LETTER TURNED M WITH LONG LEG
- U+028A (ʊ) LATIN SMALL LETTER UPSILON
- U+0076 (v) LATIN SMALL LETTER V
- U+1D20 (ᴠ) LATIN LETTER SMALL CAPITAL V
- U+028B (ʋ) LATIN SMALL LETTER V WITH HOOK
- U+028C (ʌ) LATIN SMALL LETTER TURNED V
- U+0077 (w) LATIN SMALL LETTER W
- U+1D21 (ᴡ) LATIN LETTER SMALL CAPITAL W
- U+028D (ʍ) LATIN SMALL LETTER TURNED W
- U+0078 (x) LATIN SMALL LETTER X
- U+0079 (y) LATIN SMALL LETTER Y
- U+028F (ʏ) LATIN LETTER SMALL CAPITAL Y
- U+01B4 (ƴ) LATIN SMALL LETTER Y WITH HOOK
- U+007A (z) LATIN SMALL LETTER Z
- U+1D22 (ᴢ) LATIN LETTER SMALL CAPITAL Z
- U+01B6 (ƶ) LATIN SMALL LETTER Z WITH STROKE
- U+0225 (ȥ) LATIN SMALL LETTER Z WITH HOOK
- U+0290 (ʐ) LATIN SMALL LETTER Z WITH RETROFLEX HOOK
- U+0291 (ʑ) LATIN SMALL LETTER Z WITH CURL
- U+0292 (ʒ) LATIN SMALL LETTER EZH
- U+1D23 (ᴣ) LATIN LETTER SMALL CAPITAL EZH
- U+01B9 (ƹ) LATIN SMALL LETTER EZH REVERSED
- U+01BA (ƺ) LATIN SMALL LETTER EZH WITH TAIL
- U+0293 (ʓ) LATIN SMALL LETTER EZH WITH CURL
- U+021D (ȝ) LATIN SMALL LETTER YOGH
- U+00FE (þ) LATIN SMALL LETTER THORN
- U+01BF (ƿ) LATIN LETTER WYNN
- U+01BB (ƻ) LATIN LETTER TWO WITH STROKE
- U+01A8 (ƨ) LATIN SMALL LETTER TONE TWO
- U+01BD (ƽ) LATIN SMALL LETTER TONE FIVE
- U+0185 (ƅ) LATIN SMALL LETTER TONE SIX
- U+0294 (ʔ) LATIN LETTER GLOTTAL STOP
- U+02C0 (ˀ) MODIFIER LETTER GLOTTAL STOP
- U+02BC (ʼ) MODIFIER LETTER APOSTROPHE
- U+02EE (ˮ) MODIFIER LETTER DOUBLE APOSTROPHE
- U+02BE (ʾ) MODIFIER LETTER RIGHT HALF RING
- U+0295 (ʕ) LATIN LETTER PHARYNGEAL VOICED FRICATIVE
- U+02BF (ʿ) MODIFIER LETTER LEFT HALF RING
- U+02C1 (ˁ) MODIFIER LETTER REVERSED GLOTTAL STOP
- U+1D24 (ᴤ) LATIN LETTER VOICED LARYNGEAL SPIRANT
- U+1D25 (ᴥ) LATIN LETTER AIN
- U+02A1 (ʡ) LATIN LETTER GLOTTAL STOP WITH STROKE
- U+02A2 (ʢ) LATIN LETTER REVERSED GLOTTAL STOP WITH STROKE
- U+0296 (ʖ) LATIN LETTER INVERTED GLOTTAL STOP
- U+01C0 (ǀ) LATIN LETTER DENTAL CLICK
- U+01C1 (ǁ) LATIN LETTER LATERAL CLICK
- U+01C2 (ǂ) LATIN LETTER ALVEOLAR CLICK
- U+01C3 (ǃ) LATIN LETTER RETROFLEX CLICK
- U+0297 (ʗ) LATIN LETTER STRETCHED C
- U+0298 (ʘ) LATIN LETTER BILABIAL CLICK
- U+02AC (ʬ) LATIN LETTER BILABIAL PERCUSSIVE
- U+02AD (ʭ) LATIN LETTER BIDENTAL PERCUSSIVE