From: D. Starner (shalesller@writeme.com)
Date: Sun Dec 05 2004 - 18:17:45 CST
"Philippe Verdy" <verdy_p@wanadoo.fr> writes:
> > Drop the part of the sentence before "then". A protocol could delete "the", "an", etc. right
> > now. In fact, I suspect several library systems do drop "the", etc. right now. Not that this
> > makes it a good idea, but that's a lousy argument.
>
> If such a library does this, only based on the presence of the encoded words, without wondering
> in which language the text is written, that kind of processing text will be seriously
> inefficient or inaccurate when processing other languages than English for which you will have
> built such a library.
Many libraries have large amounts of books in English, French, German, Spanish, Italian,
and various non-Latin languages. Blanket stripping of a, an, the, and la from the
start of a title might very well be good 90% heuristic for removing non-sorting
words from the start of titles. (German being the odd man out, since you can't blanket
remove a starting die.)
> For plain-text (which is what Unicode deals about), even the "an", "the", "is" words (and so
> on...) are equally important as other parts of the text.
No. It all depends on what you want to do with the text.
Besides which, the point is it doesn't matter whether or not words are encoded as
codepoints; these process can work just the same.
-- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 18:19:20 CST