Re: UCA and Russian letter Ё

From: Leo Broukhis <leob_at_mailcom.com>
Date: Fri, 21 Dec 2012 08:57:11 -0800

On Fri, Dec 21, 2012 at 4:56 AM, Leif Halvard Silli
<xn--mlform-iua_at_xn--mlform-iua.no> wrote:
>
> You say that the difference is primary in the beginning of a word but
> elsewhere secondary. And yes, that orthographic dictionary that you
> link to above, looks as you describe.
>
> However, in reality, the difference is secondary - if that is the right
> word - even as the first letter in a word. Wikipedia has the following
> example: едок > ёж > ездит.[1] And, for instance the word ёлка could
> also be written елка.

> [1] <http://en.wikipedia.org/wiki/%d0%81#Russian>

Wikipedia's example is sadly unsourced, unlike mine.

> Hence I would argue that the dictionary you linked to above considers
> the difference to *always* be secondary. It is just that the dictionary
> applies the sorting algorithm to a collection where the words that
> begins with the letter Ё has been separated from words that begins on
> the letter Е.

Isn't that notionally the same as having the difference primary for
the first letter?

>> A cursory scan of the UCA doesn't reveal if that's implementable, and
>> experiments in a fairly fresh Linux Mint yield either
>> ель < ёлка < тель < тёлка or ель < тель < тёлка < ёлка depending on
>> the LANG setting (en_US works better than ru_RU).
>
> (Both examples consider the difference primary, but the the last
> example is incorrect as the ёлка follows after the тёлка - which is
> incorrect from every angle (except from the angle of the number of the
> letter inside Unicode.)

Right. And, ironically, the [en] collation is the correct one.

>> Could someone tell if the UCA in its current form is able to support that?
>
> Is there not a need for 3 kinds of sorting? Namely: a) Е/Ё as always
> distinct letters, b) Е/Ё as always non-distinct letters, c) Е/Ё as
> non-distinct letters except when used as the first letter. (Note that
> the last variant would only be yield correct result on collections of
> words where a first-letter Ё is guaranteed be rendered with a Ё. Thus,
> if ёлка is written елка, then the result becomes incorrect.)

We're not talking here about *words per se* that may or may not be
rendered with a Ё, we're talking about letter sequences with Ё as a
given. The dictionary order shows that all word-initial Ёs go after
all word-initial Еs, but within a word the difference is secondary.
For a set of letter sequences using canonical spelling of words, the
collation algorithm should give their dictionary ordering, shouldn't
it?

Re the linguistic PS: you're right, and that proves that an
approximation to the proper collation using secondary ordering is
preferred to an approximation using primary ordering.

Leo
Received on Fri Dec 21 2012 - 10:59:26 CST

This archive was generated by hypermail 2.2.0 : Fri Dec 21 2012 - 10:59:27 CST