Contiguous Weight Ranges and Ignorables

From: Jesse Hallam (unicode.org@fentrax.com)
Date: Thu Apr 17 2008 - 13:17:21 CDT

Next message: vunzndi@vfemail.net: "Re: Using combining diacritical marks and non-zero joiners in a name"

Previous message: Andreas Prilop: "Re: Using combining diacritical marks and non-zero joiners in a name"
Next in thread: Kenneth Whistler: "Re: Contiguous Weight Ranges and Ignorables"
Maybe reply: Kenneth Whistler: "Re: Contiguous Weight Ranges and Ignorables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

[I accidentally sent this message between subscribing and confirming my
response. I do not know if it arrived. I apologize if this is received in
duplicate]

Good day,

I am pursuing an implementation of the UCA, and am attempting to employ the
table reduction technique known in the UCA as "Contiguous Weight Ranges". In
that technique, we read the following:

*Whenever collation elements have different primary weights, the ordering of
their secondary weights is immaterial.*

I clearly see how this applies to collation elements with different,
primary, non-zero weights. How can this statement hold true for primary
ignorables?

For example, consider line 27167/27168 of CollationTest_NON_IGNORABLE.txt:

*1E00 0334; # () LATIN CAPITAL LETTER A WITH RING BELOW [0FD0 | 0020
008C 0080 | 0008 0002 0002 |]*
*0332 0061; # () COMBINING LOW LINE [0FD0 | 0021 0020 | 0002 0002 |]*

After normalization, we are comparing the code points:

*<41><334><325>
<332><61>
*

These compare equal on a primary level (since only <41> and <61> have
primary weights, both of which are equal). The comparison then proceeds to
compare <41> and <332>. Noting that <41> and <332> have different primary
weights (<332> is, of course, a primary ignorable), we nevertheless see that
the ordering of their secondary weights is critical. Were my implementation
of the UCA to re-weight each secondary level according to the "Contiguous
Weight Ranges" technique, I may very well obtain an incorrect collation
result in this example.

I'm certain I am simply missing something in the language of the UCA. For
one, I note that the example given in the UCA for this technique renumbers
the secondary weights for the letter 'O', restricting the lower bound to the
initial lower bound of 0020; I see nothing in the language that would
prevent me from starting that lower bound lower, perhaps at 0002, yet for
some reason, this was not done.

Also, under "3.1.4 Default Values", we read:

*Both in the Default Unicode Collation Element Table and in typical
tailorings, most unaccented letters differ in the primary weights, but have
secondary weights (such as **a1) equal to **MIN2. The primary ignorables
will have secondary weights greater than **MIN2. *

Why primary ignorables will have weights greather than MIN_2 is not
specified, but perhaps this is a hint to implementors such as myself. Does
it relate to the above issue? I'm not certain.

Any insight or clarification into the above matter would be greatly
appreciated!

-- 
Jesse Hallam
University of Waterloo Junior
"For scarcely for a righteous man will one die: yet peradventure for a good
man some would even dare to die. But God commendeth his love toward us, in
that, *while we were yet sinners*, Christ died for us. " (Romans 5:7, 8)

Next message: vunzndi@vfemail.net: "Re: Using combining diacritical marks and non-zero joiners in a name"
Previous message: Andreas Prilop: "Re: Using combining diacritical marks and non-zero joiners in a name"
Next in thread: Kenneth Whistler: "Re: Contiguous Weight Ranges and Ignorables"
Maybe reply: Kenneth Whistler: "Re: Contiguous Weight Ranges and Ignorables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Apr 17 2008 - 13:27:01 CDT