From: Ngwe Tun (ngwestar@gmail.com)
Date: Sat Aug 25 2007 - 14:04:49 CDT
Hi Friends
By our complex nature of languages, In Burmese, Khmer, Laos and Myanmar
haven't word break marker in our sentences. So, We need to hardly identify
word and put a marker by operator or automatic. It's really hard work to
make. some language needed to identify sentence ending marker also.
Then, I founded good discussion @
http://blogamundo.net/dev/2006/12/28/the-zero-width-space/ concerning ZWSP.
I'm doubting that ZWSP adding in Burmese language text storing in documents,
files and databases.
In Burmese; character(s) combined to a syllable, syllable(s) combined to
a word, word(s) combined to a sentence. We Just have sentence ending marker.
We can identified by a syllable by rule based solution. We do not need too
much complex in syllabification. But We have to work too much in word
identification. I'm hoping that we need to have good lexical resources.
(Please correct me if I'm wrong)
So, After syllable breaking algorithm, we can have segmented text by
syllabification. here, I wanted to know what syllable break marker are using
on your language (Thai, Khmer, Lao and any possible language). Second, I
hope so Is it needed to add by operator or automatic by shaping
engine(Uniscribe, Pango, QT, ATT or what else Programs) or input
method(Keyboards, IME or On Screen Keyboard).
ZWSP are defined in Unicode 5.0 @ Section 16.2 Page 535;
*
Zero Width Space. *The U+200B zero width space indicates a word boundary,
except that* *it has no width. Zero-width space characters are intended to
be used in languages that have* *no visible word spacing to represent word
breaks, such as Thai, Khmer, and Japanese.* *When text is justified, ZWSP
has no effect on *letter* spacing—for example, in English or Japanese usage.
We have to use ZWSP for the word breaking in our language. So, We need to
use ZWSP for line breaking purpose too. Every Burmese word might follow ZWSP
when automatically adding or operator.
Please let me have last clarification. Do We need to store ZWSP in
documents, files and database for the purpose of word segmentation/breaking?
Or Is it possible to add automatically in others way?
Let me have your experiences in word breaking and ZWSP issues in your
language.
Thanks in advance.
Ngwe Tun.
-- In Burmese; Ngwe mean 1) Silver 2) Money 3) Second Awards; Tun mean 1) Light 2) be prominent.
This archive was generated by hypermail 2.1.5 : Sat Aug 25 2007 - 14:08:12 CDT