Re: New version of TR29:

From: John Burger (john@mitre.org)
Date: Thu Aug 15 2002 - 13:51:57 EDT

Previous message: Raymond Mercier: "Re: The mystery of Edwin U+1E9A"
In reply to: Mark Davis: "New version of TR29:"
Next in thread: Samphan Raruenrom: "Re: New version of TR29:"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

My immediate reaction to this TR was that it was doomed, given how
difficult it is to tokenize text perfectly (I have written a number of
tokenizers for natural language processing, and they are never
complete). However, after reading the draft, I found myself agreeing
that it is reasonable to provide =some= guidance for the 80% solution.
So, I looked at the code for some of my tokenizers. Most of the special
cases covered there are not appropriate for the TR, but I do have the
following suggestion:

Consider adding U+0026 (ampersand) to the MidLetter class. I did a
quick scan through a few million words of New York Times data I have,
and found that most mid-word occurrences would probably not induce word
breaks, e.g.,

   Q&A
   R&R
   AT&T
   P&G
   ...

Exceptions included:

Ben&Jerry
How&Why

Perhaps a more conservative rule would involve only uppercase letters ...

A caveat: I am unfamiliar with analogous cases in languages other than
English.

- John Burger
MITRE

Previous message: Raymond Mercier: "Re: The mystery of Edwin U+1E9A"
In reply to: Mark Davis: "New version of TR29:"
Next in thread: Samphan Raruenrom: "Re: New version of TR29:"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Thu Aug 15 2002 - 12:00:22 EDT