Re: data for collation tests

From: Alain LaBont/e'/ (alb@riq.qc.ca)
Date: Sun Feb 02 1997 - 14:39:32 EST


At 16:47 1997-1-24 -0800, Xiu Lu wrote:
>Do you know where and how to get test data files for collation tests (in
>Japanese, German and French). I have a program that can read a test data
>file and does collation, then output a sorted result to a file. But I do
>not have proper data to test this program .
>
>***********************************************************************
>* Xiu Lu 415-937-4595 (tel) *
>* Internationalization, Server products xiulu@netscape.com *
>* Netscape Communications Corporation http://home.netscape.com *
>***********************************************************************

Sorry to have taken so much time to answer, I was submerged (and am still)
with messages and to-do's. Here is the benchmark of CAN/CSA Z243.4.1
Canadian ordering standard. The first list is the unsorted one that shall be
used as input for first hand testing. The second one is the prescribed
results following the rules of the standard. This is sorted according to
major French dictionaries. There are some extra non-French words also. Our
rules also sort English correctly according to English dictionaries that
have written and established rules (that said Michael Everson will tell you
that English-speakers learn that upper case is sorted before lower case [and
that his reference, the Concise Oxford English dictionaries does this in
practice, what is not verified though in the complete Oxford English
Dictionary that I have at home -- 325000 words -- as there are no specific
rules for case in the latter] . We chose in Canada to do what the English
dictionaries writing their rules do, hence harmonizing also with German,
which sorts lower case before upper case). French dictionaries have no rule
for case ordering but have precise, albeit arcane, rules for accents, which
we respect.

I had to use QP coding (unfortunately... this coding should not exist,
everybody should turn to 8-bit MIME) to make sure character bits would not
be stealed by criminally behaving servers [as the one used for this request
): ]. Sorry about this.

First list (unsorted):

ou
lésé
péché
vice-président
9999

haïe
coop
caennais
lèse

air@@@
côlon
bohème
gêné
lamé
pêche
LÈS
vice versa
C.A.F.
cæsium
resumé
Bohémien
co-op
pêcher
les
CÔTÉ
résumé
Ålborg
cañon
du
haie
pécher
Mc Arthur
cote
colon
l'âme
resume
élève
Canon
lame
Bohême
0000
relève
gène
casanier
élevé
COTÉ
relevé
Grossist
vice-presidents' offices
Copenhagen
côte
McArthur
Mc Mahon
Aalborg
Größe
vice-president's offices
cølibat
PÉCHÉ
COOP
@@@air
VICE-VERSA
gêne
CO-OP
révélé
révèle
çà et là
Noël
île
aïeul
Île d'Orléans
nôtre
notre
août
NOËL
@@@@@
L'Haÿ-les-Roses
CÔTE
COTE
côté
coté
aide
air
vice-president
modelé
MODÈLE
maçon
MÂCON
pèche
pêché
pechère
péchère

Second list (sorted correctly):

@@@@@
0000
9999
Aalborg
aide
aïeul
air
@@@air
air@@@
Ålborg
août
bohème
Bohême
Bohémien
caennais
cæsium
çà et là
C.A.F.
Canon
cañon
casanier
cølibat
colon
côlon
coop
co-op
COOP
CO-OP
Copenhagen
cote
COTE
côte
CÔTE
coté
COTÉ
côté
CÔTÉ
du

élève
élevé
gène
gêne
gêné
Größe
Grossist
haie
haïe
île
Île d'Orléans
lame
l'âme
lamé
les
LÈS
lèse
lésé
L'Haÿ-les-Roses
MÂCON
maçon
McArthur
Mc Arthur
Mc Mahon
MODÈLE
modelé
Noël
NOËL
notre
nôtre
ou

pèche
pêche
péché
PÉCHÉ
pêché
pécher
pêcher
pechère
péchère
relève
relevé
resume
resumé
résumé
révèle
révélé
vice-president
vice-président
vice-president's offices
vice-presidents' offices
vice versa
VICE-VERSA

Best Regards.

Alain LaBonté
Québec

Project Editor, CAN/CSA Z243.4.1 (Canadian Ordering Standard for en and fr)
Project Editor, ISO/IEC 14651 (Ordering standard for UCS/UNICODE)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT