PAN-CYRILLIC ORDERING; J.CLEWS; 1997-11-19.

From: John Clews (10646er@sesame.demon.co.uk)
Date: Wed Nov 19 1997 - 08:32:40 EST


PAN-CYRILLIC ORDERING; J.CLEWS; 1997-11-19.

Note from John Clews:

For Pan-Cyrillic ordering, after considerable amount of comparison of
Cyrillic sorting orders, I now agree (except for 1.-3. below) with
the list supplied by Michael Everson, (was: SC22WG20 N547 on Cyrillic
ordering) (URL: http://www.indigo.ie/egt/standards/cy/n547.html).

There are only the following exceptions in my agreement:
1. placement of KOPPA
2. placement of IOTIFIED LITTLE YUS.
3. requirement for the addition of three Cyrillic letter pairs to
   international standards ISO/IEC 10646, to ISO/IEC 14651, and to ISO 9:
   Cyrillic Q, Cyrillic W and Cyrillic SHTE (Church Slavonic), all of
   them in upper and lower case forms.

Below is Michael Everson's list from his web page earlier this week,
with my usual abbreviation of CYRILLIC SMALL LETTER to Cy_ .

>>>> indicates where I think alternative placements should be;
XXXX indicates where I think these characters should not go.
>>>>>>>> indicates proposed additions to ISO/IEC 10646, 14651, ISO 9.

The justifications for these are embedded in Michael Everson's
charact below, and also in his narrative text below that. In a
separate email, I will show a merged list which will include
expectations from other scripts too.

> <U0430> <acy8> % Cy_A
> <U04D1> <acybreve8> % Cy_A_BREVE
> <U04D3> <acydieresis8> % Cy_A_DIAERESIS
> <U04D5> <aecy8> % CYRILLIC SMALL LIGATURE A IE
> <U0431> <be8> % Cy_BE
> <U0432> <ve8> % Cy_VE
> <U0433> <ge8> % Cy_GHE
> <U0453> <gje8> % Cy_GJE
> <U0493> <gebar8> % Cy_GHE_STROKE
> <U0491> <geupturn8> % Cy_GHE_UPTURN
> <U0495> <gehook8> % Cy_GHE_MIDDLE HOOK
> <U0434> <de8> % Cy_DE
> <U0452> <dje8> % Cy_DJE
> <U0435> <ie8> % Cy_IE
> <U04D7> <iebreve8> % Cy_IE_BREVE
> <U0451> <io8> % Cy_IO
> <U0454> <ecy8> % Cy_UKRAINIAN IE
> <U04D9> <schwacy8> % Cy_SCHWA
> <U04DB> <schwacydieresis8> % Cy_SCHWA_DIAERESIS
> <U04BD> <cheabkhasian8> % Cy_ABKHASIAN CHE
> <U04BF> <cheabkhasiandes8> % Cy_ABKHASIAN CHE_DESCENDER
> <U0436> <zhe8> % Cy_ZHE
> <U04C2> <zhebreve8> % Cy_ZHE_BREVE
> <U04DD> <zhedieresis8> % Cy_ZHE_DIAERESIS
> <U0497> <zhertdes8> % Cy_ZHE_DESCENDER
> <U0455> <dze8> % Cy_DZE
> <U0437> <ze8> % Cy_ZE
> <U04DF> <zedieresis8> % Cy_ZE_DIAERESIS
> <U0499> <zecedilla8> % Cy_ZE_DESCENDER
> <U04E1> <ezhcy8> % Cy_ABKHASIAN DZE
> <U0438> <ii8> % Cy_I
> <U0439> <iibreve8> % Cy_SHORT I
> <U04E5> <iidieresis8> % Cy_I_DIAERESIS
> <U04E3> <iimacron8> % Cy_I_MACRON
> <U0456> <icy8> % Cy_BYELORUSSIAN-UKRAINIAN I
> <U0457> <yi8> % Cy_YI
> <U0458> <je8> % Cy_JE
> <U043A> <ka8> % Cy_KA
> <U045C> <kje8> % Cy_KJE
> <U049F> <kabar8> % Cy_KA_STROKE
> <U049D> <kavertbar8> % Cy_KA_VERTICAL STROKE
> <U049B> <kartdes8> % Cy_KA_DESCENDER
> <U04A1> <kabashkir8> % Cy_BASHKIR KA
> <U04C4> <kahook8> % Cy_KA_HOOK
> <U043B> <el8> % Cy_EL
> <U0459> <lje8> % Cy_LJE
> <U043C> <em8> % Cy_EM
> <U043D> <en8> % Cy_EN
> <U045A> <nje8> % Cy_NJE
> <U04A3> <enrtdes8> % Cy_EN_DESCENDER
> <U04A5> <engcy8> % CYRILLIC SMALL LIGATURE EN GHE
> <U04C8> <enhook8> % Cy_EN_HOOK
> <U043E> <ocy8> % Cy_O
> <U04E7> <ocydieresis8> % Cy_O_DIAERESIS
> <U04E9> <ocybar8> % Cy_BARRED O
> <U04EB> <ocybardieresis8> % Cy_BARRED O_DIAERESIS
> <U04A9> <haabkhasian8> % Cy_ABKHASIAN HA
> <U043F> <pecy8> % Cy_PE
> <U04A7> <pehook8> % Cy_PE_MIDDLE HOOK

>>>> > <U0481> <koppacy8> % Cy_KOPPA

Rationale: P, Q, R expected in Latin;
PE, KOPPA, ER expected in Hebrew, and Armenian.

>>>>>>>> [Cyrillic q - used in Kurdish]

Rationale: placed here rather than at the end of thealphabetic
sequence, in line with P Q R expectations of other scripts.

Cyrillic Q and W were in Johan van Wingen's original proposal, but
according to him were deleted at the insistence of (a) US delegate(s).
The logic of excluding Cyrillic Q and W means that Kurdish mixes two
scripts simultaneously in order to write it: ISO/IEC 10646 does not
require this for any other language.

> <U0440> <er8> % Cy_ER
> <U0441> <es8> % Cy_ES
> <U04AB> <escedilla8> % Cy_ES_DESCENDER
> <U0442> <te8> % Cy_TE
> <U04AD> <tertdes8> % Cy_TE_DESCENDER
> <U045B> <tshe8> % Cy_TSHE
> <U0443> <ucy8> % Cy_U
> <U045E> <ucybreve8> % Cy_SHORT U
> <U04F1> <ucydieresis8> % Cy_U_DIAERESIS
> <U04F3> <ucydblacute8> % Cy_U_DOUBLE ACUTE
> <U04EF> <ucymacron8> % Cy_U_MACRON
> <U04AF> <ustrt8> % Cy_STRAIGHT U
> <U04B1> <ustrtbar8> % Cy_STRAIGHT U_STROKE
> <U0479> <uk8> % Cy_UK

>>>>>>>> [Cyrillic w - used in Kurdish]

Rationale: placed here rather than at the end of the alphabetic
sequence, in line with U [V] W expectations of other scripts, much as
Abkhazian SCHWA and CHE are placed after Cyrillic E and O.

Cyrillic Q and W were in Johan van Wingen's original proposal, but
according to him were deleted at the insistence of (a) US delegate(s).
The logic of excluding Cyrillic Q and W means that Kurdish mixes two
scripts simultaneously in order to write it: ISO/IEC 10646 does not
require this for any other language.

> <U0444> <ef8> % Cy_EF
> <U0445> <kha8> % Cy_HA
> <U04B3> <khartdes8> % Cy_HA_DESCENDER
> <U04BB> <hcy8> % Cy_SHHA
> <U0461> <omegacy8> % Cy_OMEGA
> <U047F> "<omegacy8><te8>" % CYRILLIC SMALL LETTER OT
> <U047D> <omegacytitlo8> % Cy_OMEGA_TITLO
> <U047B> <omegacyround8> % Cy_ROUND OMEGA

>>>>>>>> [Cyrillic SHTE (Church Slavonic)

Rationale: placed here due to the traditional OMEGA, SHTE, TSE order
in Church Slavonic, as here. This is a separate character, and it is
proposed that it is NOT unified with Cyrillic SHCHA. All variants of
SHCHA look like SHA with a right descender; SHTE (Church Slavonic)
always looks like SHA with a central descender.

> <U0446> <tse8> % Cy_TSE
> <U04B5> <ttse8> % CYRILLIC SMALL LIGATURE TE TSE

XXXX > <U0481> <koppacy8> % Cy_KOPPA

> <U0447> <che8> % Cy_CHE
> <U04F5> <chedieresis8> % Cy_CHE_DIAERESIS
> <U04B9> <chevertbar8> % Cy_CHE_VERTICAL STROKE
> <U04B7> <chertdes8> % Cy_CHE_DESCENDER
> <U04CC> <cheleftdes8> % Cy_KHAKASSIAN CHE
> <U045F> <dzhe8> % Cy_DZHE
> <U0448> <sha8> % Cy_SHA
> <U0449> <shcha8> % Cy_SHCHA
> <U044A> <hard8> % Cy_HARD SIGN
> <U044B> <yeri8> % Cy_YERU
> <U04F9> <yeridieresis8> % Cy_YERU_DIAERESIS
> <U044C> <soft8> % Cy_SOFT SIGN
> <U0463> <yat8> % Cy_YAT
> <U044D> <ecyrev8> % Cy_E
> <U044E> <iu8> % Cy_YU
> <U044F> <ia8> % Cy_YA

> <U0465> <eiotified8> % Cy_IOTIFIED E
> <U0467> <yuslittle8> % Cy_LITTLE YUS

XXXX > <U0469> <yuslittleiotified8> % CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS

> <U046B> <yusbig8> % Cy_BIG YUS

>>>> > <U0469> <yuslittleiotified8> % CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS

> <U046D> <yusbigiotified8> % Cy_IOTIFIED BIG YUS

Rationale: the order of 0465, 0467, 046B, 0469, 046D is applied
universally in several sources on Church Slavonic.

See also Disagreement on Iotifiation below.

> <U046F> <xicy8> % Cy_KSI
> <U0471> <psicy8> % Cy_PSI
> <U0473> <fita8> % Cy_FITA
> <U0475> <izhitsa8> % Cy_IZHITSA
> <U0477> <izhitsadblgrave8> % Cy_IZHITSA_DOUBLE GRAVE ACCENT
> <U04C0> <palochka8> % Cy_PALOCHKA

Michael Everson's narrative text read:

> UCS generic collation locale -- rationale for Cyrillic ...
> CEN/ISO SC22/WG20 N547
> Date: 1997-11-13...
> SOURCE:
> MICHAEL EVERSON, EGT (IE)
> STATUS:
> EXPERT CONTRIBUTION
> ACTION:
> FOR CONSIDERATION BY SC22/WG20
> DISTRIBUTION: SC22/WG20, UTC
> ___________________________________
> Recently Johan van Wingen posted a number of queries regarding the
> ordering of Cyrillic characters in the current draft of ISO 14651 to
> the SC22/WG20 e-mail reflector... [text quoted from J.W. van Wingen]
> ___________________________________

[Michael Everson continues]

> The order given fo the Cyrillic script in ISO 14651 are based on the
> same principles which are used to order the Latin script. The chief
> reason for this is that the Cyrillic script is used for a great many
> languages, each with its own unique ordering. It is impossible to
> reconcile all of these orderings, so a generic ordering is given which
> can be tailored to meet the needs of individual languages. The generic
> ordering does not favour any particular language, but is based on the
> graphic form of the character.
>
> The basic order of the Cyrillic script is taken to be that of the
> prototypical alphabet, Old Church Slavonic...

>
> Although Cyrillic letters are, by convention, considered separate at
> level 1 of the sort, nevertheless, in order to be consisitent with the
> ordering of the Latin and Greek scripts in 14651, similar characters
> are ranked at level 1 as though they had accents at level 3. The
> ranking gives us a logical, predictable order, not unlike that used for
> Latin and Greek, and is in accord with what information we have about
> ordering Cyrillic in general. The order of the "accents" given for
> Cyrillic is as follows:
>
> PECULIAR, ACUTE, BREVE, DIAERESIS, DOUBLE ACUTE, MIDDLE TILDE, BAR,
> VERTICAL BAR, DESCENDER, LEFT DESCENDER, MACRON, TOPBAR, VARIANT,
> MIDDLE HOOK, YOTIFIER.

Disagreement on Iotifiation: Yotifier is not an accent: the same
order for the iotified characters and their non-iotified cognates
always seems to be followed in Church Slavonic, even if not
internally consistent.

This only makes for one character difference in any case.

> This order is not identical to that found in Musaev, but it is fairly
> close (here BREVE precedes DIAERESIS but in Musaev DIAERESIS precedes
> BREVE), and Musaev is not intended to be normative. The order is
> isomorphic to the order of general accents already found among the
> collating symbols in ISO 14651, where BREVE precedes DIAERESIS. It need
> hardly be said that the order of accents is a fairly arbitrary thing in
> 14651....
>
> To answer some specific questions:
>
> Abxazian letters should be sorted as derived letters (following IE
> and O), not as new basic letters (added to the end of the alphabet).
> This is in accordance with Abxazian practice.

It is also in accordance with user expectations - they are "e and o
look-alikes" regardless of what "sound-alikes" they are.

>
> Johan asked: "Why are the four pre-1917 letters mixed up with the
> historic ones, and not placed at their pre-1917 position?" If he
> means the four letters used in the Russian language before 1917
> (BYELORUSSIAN-UKRAINIAN I, YAT, FITA, IZHITSA), then the answer is
> that Russian was not taken as the base, but Old Church Slavonic, as
> this is more generic and language-independent. Nevertheless, upon
> checking Faulmann 1880 I find that the relative order of I,...
> BYELORUSSIAN-UKRAINIAN I,... SOFT SIGN, YAT,... YA... FITA, IZHITSA
> is indeed the order used in Russian, so the ordering here is
> conformant with the practice of that large and important language...

Agreed: this ordering is consistent with Church Slavonic, Russian,
and most other Cyrillic srcript languages where they have occured.

> The digraphs, trigraphs, and tetragraphs given in Musaev can be
> tailored in implementations of 14651, but are outside the scope of
> the basic table.
>
> The principles used to sort Cyrillic in 14651 can be seen to be the
> same as those employed for Latin, a script also used for many
> languages. No one language is favoured over any other. There is to my
> knowledge only one possible outstanding issue:
>
> The order where DZE precedes ZE follows the historical order of
> these letters in Old Church Slavonic where ZELO precedes ZEMLJA. DZE
> is identical with ZELO and ZE is identical with ZEMLJA. When the
> characters are given their numeric values (as they often are when
> used as dates in books), ZELO is 6 (8 in Glagolitic) and ZEMLJA is 7
> (9 in Glagolitic). In Old Ukrainian ZELO precedes ZEMLJA. In Old
> Romanian SALO precedes SEMLIA. The open issue is this: I have heard,
> but can neither confirm nor deny with the sources I have to hand,
> that in the Macedonian language ZE precedes DZE.

That does seem to be the case for Macedonian in my sources too.
Just as there is a strong case for adding a further Cyrillic character
SHTE (see above), there is also a case (admittedly less strong, and
with more possibilities for confusion that with SHTE) for adding a
further character Cyrillic ZELO (Church Slavonic) to ISO/IEC 10646,
to ISO/IEC 14651, and to ISO 9. This would allow the ideal sorting of
Church Slavonic, modern Russian (etc) _and_ Macedonian. It would be
necesary for these standards to point out that these were different
characters, but sorting out the differences would be up to the
implementors.

> The default order
> of 14651 where DZE precedes ZE can be tailored for modern Macedonian
> needs just as the order of 敦must be tailored for in Danish.
> However I felt it my duty to bring attention to the point. There is
> a lot of literature using the Cyrillic script which would require
> the Old Church Slavonic ordering, and I felt that it was best (and
> less contentious) to stick to the layout of the prototypical
> Cyrillic script for ISO 14651.
>
> The ... table [above] is correct according to the principles discussed.
> The table appearing in the current draft does not follow these
> principles, but that is an error, and the table here is what ought to
> be used. National Bodies [and liaison members too! -John Clews] take note.
>
> ... HTML Michael Everson, everson@indigo.ie http://www.indigo.ie/egt
> Dublin, 1997-11-13
>

-- 
Chair of ISO/TC46/SC2: Conversion of Written Languages;
Member of CEN/TC304: Character Set Technology;
Member of ISO/IEC/JTC1/SC2: Character Sets.

SESAME Computer Projects, 8 Avenue Road, Harrogate, HG2 7PG, England Email: Converse@sesame.demon.co.uk; tel: +44 (0) 1423 888 432



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:38 EDT