RE: extracting words

From: Christopher John Fynn (cfynn@druknet.net.bt)
Date: Wed Feb 14 2001 - 03:06:31 EST


 Mark Davis wrote:

> BTW, someone on this thread made this topic out to be even more complex than
> is: that Devanagari and Korean are written without spaces. While that may
> have been the case historically, I believe that the modern text does use
> spaces. Chinese, Japanese and Thai are the main languages written without
> spaces.

Several Indic languages/scripts do not use spaces (or other marker characters) between words or syllables. I don't think you can even rely on spaces between words for all the different Indic languages that use only the devanagari script.

Tibetan script has a "syllable" (or morpheme) separator [U+0F0B] which provides a line break opportunity - but in modern Dzongkha (Bhutanese) this character is dropped in many places where a reader can determine the boundary by grammatical rules. BTW In traditional Tibetan orthography, a space is *not* a line break opportunity.

- Chris

--
Chris Fynn
DDC Dzongkha Computing Project
Thimphu, Bhutan.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT