Re: extracting words

From: Jungshik Shin (jshin@mailaps.org)
Date: Sun Feb 11 2001 - 03:47:14 EST


On Sat, 10 Feb 2001, Edward Cherlin wrote:

> At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:

> >I'm writing a C-program that is called Blacklist, It's purpose is to accept
> >a string (unicode) and extract words from it, then hash the found words
> >according to a hashing algorythm and see if the word is in blacklist
> >hashtable.
> >
> >This is all very straightforward, but the problem is the extracting of
> >wordsfrom this string.
> >How do i determine what a word is in Japanese or Korean or whatever other
> >language? { a space ? }
>
> No. Chinese and Japanese almost never have spaces between words, and
> they are not required in Korean.

I'm afraid this is a little bit misleading. In modern Korean orthography,
every word is delimeted by space (Korean Orthographic Rules, article 2 :
1988-01-19, Ministry of Education, ROK). The exception for that rule is
that particles (Josa) have to follow the preceding word without space
(ibid, article 41). There are also some minor exceptions (ibid, article
43, article 47, article 49 ) so you might say you're correct in that
spaces are *not required* in Korean, but the principle of delimeting
every pair of words with a space is still there.

> Yes, we have had it for a long time; no, nobody has solved it
> entirely; and yes, this approach is wrong. Breaking a string into
> words may require a thorough understanding of the vocabulary and
> grammar of the language, and even that may not be enough.

I absolutely argee with you on this point.

> An example from Korean: Abeojigabangeisseoyo. Should this be segmented as
> Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange
> isseoyo (Father is in the bag)?

I don't think this is such a good example for your case for the enormous
difficulties and complexities involved in extracting words (which
I agree on) because the original question was how to extract words
out of 'supposedly orthographically correct sentences'. Your example
(Abeojigabangeisseoyo) clearly violates the (modern) orthographic rule
by glueing together all the words without space (nobody would write
that way). One of the reasons that spaces are used to separate words
in Korean writing is to break/lift this kind of degeneracy (as taught
in the first grade Korean class). It would have been more appropriate
if you had come up with an example from Japanese or Chinese where spaces
are rarely used to separate words.

Jungshik Shin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT