Re: Unicode Collation Algorithm

From: Åke Persson (ake.persson@mimer.se)
Date: Fri May 16 2008 - 01:52:14 CDT

Next message: Daniel Ehrenberg: "Re: Unicode Collation Algorithm"

Previous message: Erkki I. Kolehmainen: "RE: Exemplifying apostrophes"
In reply to: Daniel Ehrenberg: "Unicode Collation Algorithm"
Next in thread: Daniel Ehrenberg: "Re: Unicode Collation Algorithm"
Reply: Daniel Ehrenberg: "Re: Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Daniel Ehrenberg wrote:

> I'm trying to implement the Unicode Collation Algorithm, and I'm a
> little confused by line 36099 of CollationTest_SHIFTED.txt. It is:
>
> 006C 00B7 0021; # (l·) LATIN SMALL LETTER L, MIDDLE DOT [1262 | 0020
> 01AF | 0002 0002 | FFFF FFFF 0258]
>
> Here are the collation keys for the characters that it uses:
>
> 006C ; [.1262.0020.0002.006C] # LATIN SMALL LETTER L
> 00B7 ; [*0279.0020.0002.00B7] # MIDDLE DOT
> 0021 ; [*0258.0020.0002.0021] # EXCLAMATION MARK
>
> All elements have combining class 0 and the string is already in NFD.

> The asterisks indicate that an element is variable-weighted. Why,
> then, in the key given, is U+00B7 treated as if it is not
> variable-weighted? I'm treating variable weighted elements as shifted,
> not non-ignorable, and as far as I can tell there's no way for a
> variable-weighted element to not get shifted based on the context. So,
> by my calculations, the actual collation key should be [1262 | 0020 |
> 0002 | FFFF 0279 0258]. This would make it precede the previous line
> in sort order. Could somebody help me figure this out?

The following lines from allkeys-5.1.0.txt is the answer to your confusion:

0140 ; [.24C4.0020.0002.0140][.0000.01AF.0002.0140] # LATIN SMALL LETTER L WITH MIDDLE DOT; QQKL

006C 00B7 ; [.24C4.0020.0002.0140][.0000.01AF.0002.0140] # LATIN SMALL LETTER L WITH MIDDLE DOT

Surprisingly, U+00B7 is always treated as an accent when preceded by LETTER L, a behaviour that belongs to a tailoring for Catalan.

The expectation was something like the following line from allkeys-4.0.0.txt:

0140 ; [.1E5C.0020.0004.0140][*0167.0020.0004.0140] # LATIN SMALL LETTER L WITH MIDDLE DOT; QQKN

Kind regards,

Åke Persson

Next message: Daniel Ehrenberg: "Re: Unicode Collation Algorithm"
Previous message: Erkki I. Kolehmainen: "RE: Exemplifying apostrophes"
In reply to: Daniel Ehrenberg: "Unicode Collation Algorithm"
Next in thread: Daniel Ehrenberg: "Re: Unicode Collation Algorithm"
Reply: Daniel Ehrenberg: "Re: Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 16 2008 - 01:55:17 CDT