RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: CE Whitehead (cewcathar@hotmail.com)
Date: Wed Jul 07 2010 - 20:57:08 CDT

  • Next message: philip chastney: "Re: Draft Proposal to encode the English Phonotypic Alphabet"

     Hi.
    > Date: Wed, 7 Jul 2010 17:34:58 -0700
    > From: kenw@sybase.com
    > [snipping all the word breaking discussion, which I am not going
    to comment on ... ]
     
    I actually had a few questions about this . . . but o.k.

     

    (Hope this 'aside' won't take much list time)

     
    > Which means that you prefer a field-by-field collation for names,
    > rather than a merged field collation.
    Yes generally.
    > But this collation departs
    > from the default UCA ordering in other ways. To get these results
    > you would have to be treating the presence of the accents as
    > a tertiary difference (comparable to the casing differences),
    > Well not quite; for case versus accent, here's the order I would prefer:
    Disilva
    disilva
    Di'silva
    di'silva
     
    or
    Ete
    ete
    E'te'
    e't'e
     
    So case is not as important as accent.

    In any case, I may be confused, but it seems that in the example from UTS #10 you have considered the difference between di Silva and di Si'lva to be more trivial than the difference between di Silva and Disilva in spite of the fact that you are merging at word boundaries -- that is you put Disilva after both of the other two in spite of the fact that it does not differ from di Silva in anything but spacing and case (or is there something else going on?):
     
    > (EXAMPLE 2: sort from UAX 10 samples)
    > di Silva, Fred
    > di Si'lva, Fred
    > Disilva, Fred

    > *and* have a special rule that orders strings with spaces
    > ahead of strings with identical primary weighting but without
    > spaces.
    Yes, that's it.

    > . . .
    > You just may not be used to applications that would do the more
    > complex operation suggested by that second select statement,
    > rather than the easier-to-implement first select statement.
    >
    No I suppose not; because of my training in English or something, I think the last name gets priority over the first. So that the first name is not really important if the last name is different -- and accents are clearly part of the spelling, so thus, the first name field is not important if the last names differ in the placement of an accent.
     
    However,
    http://wiki.services.openoffice.org/wiki/Bibliographic/OOoBib_Functional_Requirements/Name_Sorting
    agrees with you here -- the accent is ignored in some cases.

    However see:
    http://unicode.org/faq/collation.html
    "Q: How are names in a database sorted properly?"
    to see that my solution -- where a secondary or tertiary difference in one field can 'swamp a primary difference' -- is sometimes o.k.
    > As for me, if I was trying to find all the "Fred Disilva" records
    > in my database, I would certainly prefer the second ordering
    > over the first, as it would make the results more immediately
    > usable for me.
    >
    Yes, I would too, normally.
    However sometimes the difference in accent means a different name.
    In this case I would prefer to have separate files sorted by last name.
    > > I gather however that the second option is how search engines
    > > collate as search engines may treat hyphens as being the same
    > > as white space, and two-word and one-word variants of the otherwise
    > > same string may be equated too -- just to get more matches in hopes
    > > of getting the best one -- which is good because we make mistakes
    >
    > That is an entirely separate issue. Search engines tend to
    > suppress space and punctuation in matching search *strings*.
    > You are talking there about *matching* behavior, not *ordering*,
    > and the question really has nothing to do with word boundaries,
    > let alone distinct fields in a database.
    >
    Right. I realize this now that you mention it.
    > > -- but I still cannot accept the sort in Table 6)
    >
    > To each his own, I suppose. :-)
    >
    Maybe, yes.

     

    Best,

     

    C. E. Whitehead

    cewcathar@hotmail.com
    > --Ken
    >

                                                   



    This archive was generated by hypermail 2.1.5 : Wed Jul 07 2010 - 21:00:14 CDT