[OT] RE: FW: extracting words

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Sun Feb 11 2001 - 14:59:57 EST


On Sun, 11 Feb 2001, Mike Lischke wrote:

> > If you are willing to give up precision, then you can use heuristics.
> >
> > It's ugly but perhaps ok for a simple editor. You can improve the
> > precision
> > with better heuristics and more data, so you get to decide how much is
> > good enough...
>
> So using white spaces for general word breaking and ideographs for CJK
> would be an acceptable approach? What I wonder about is how to handle

No, that is not acceptable for Chinese. Chinese text does not use white
space anywhere.[1] What was described was that it is tolerable (but not
perfect--e.g., punctuation is not handled properly) to break *lines* in
Chinese text between Chinese characters. To break *words* properly in
Chinese text, you really need a dictionary.[2]

[1] There is some Chinese text with spaces, where a space is inserted
after each Chinese character, but that is a hack to make word-wrapping
behave properly on Chinese-unaware software (which would otherwise treat
an entire paragraph of Chinese text as a single "word").

[2] You might get away with treating each Chinese character as a "word",
but this is technically wrong from linguistic standpoint, despite cultural
claims to the contrary, and will have implications.

The handling of Japanese and Korean text is different from that of Chinese
(lumping them together as "CJK" is inappropriate in this context), but I
will leave them for others to provide a better treatment. (Jungshik Shin
has already explained the Korean case.)

Thomas Chan
tc31@cornell.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT