My immediate reaction to this TR was that it was doomed, given how
difficult it is to tokenize text perfectly (I have written a number of
tokenizers for natural language processing, and they are never
complete). However, after reading the draft, I found myself agreeing
that it is reasonable to provide =some= guidance for the 80% solution.
So, I looked at the code for some of my tokenizers. Most of the special
cases covered there are not appropriate for the TR, but I do have the
following suggestion:
Consider adding U+0026 (ampersand) to the MidLetter class. I did a
quick scan through a few million words of New York Times data I have,
and found that most mid-word occurrences would probably not induce word
breaks, e.g.,
Q&A
R&R
AT&T
P&G
...
Exceptions included:
Ben&Jerry
How&Why
Perhaps a more conservative rule would involve only uppercase letters ...
A caveat: I am unfamiliar with analogous cases in languages other than
English.
- John Burger
MITRE
This archive was generated by hypermail 2.1.2 : Thu Aug 15 2002 - 12:00:22 EDT