issues storing ZWSP in docs, files and databases

From: Ngwe Tun (ngwestar@gmail.com)
Date: Sat Aug 25 2007 - 14:04:49 CDT

  • Next message: Doug Ewell: "Re: issues storing ZWSP in docs, files and databases"

    Hi Friends

    By our complex nature of languages, In Burmese, Khmer, Laos and Myanmar
    haven't word break marker in our sentences. So, We need to hardly identify
    word and put a marker by operator or automatic. It's really hard work to
    make. some language needed to identify sentence ending marker also.

    Then, I founded good discussion @
    http://blogamundo.net/dev/2006/12/28/the-zero-width-space/ concerning ZWSP.

    I'm doubting that ZWSP adding in Burmese language text storing in documents,
    files and databases.

    In Burmese; character(s) combined to a syllable, syllable(s) combined to
    a word, word(s) combined to a sentence. We Just have sentence ending marker.
    We can identified by a syllable by rule based solution. We do not need too
    much complex in syllabification. But We have to work too much in word
    identification. I'm hoping that we need to have good lexical resources.
    (Please correct me if I'm wrong)

    So, After syllable breaking algorithm, we can have segmented text by
    syllabification. here, I wanted to know what syllable break marker are using
    on your language (Thai, Khmer, Lao and any possible language). Second, I
    hope so Is it needed to add by operator or automatic by shaping
    engine(Uniscribe, Pango, QT, ATT or what else Programs) or input
    method(Keyboards, IME or On Screen Keyboard).

    ZWSP are defined in Unicode 5.0 @ Section 16.2 Page 535;
    *

    Zero Width Space. *The U+200B zero width space indicates a word boundary,
    except that* *it has no width. Zero-width space characters are intended to
    be used in languages that have* *no visible word spacing to represent word
    breaks, such as Thai, Khmer, and Japanese.* *When text is justified, ZWSP
    has no effect on *letter* spacing—for example, in English or Japanese usage.

    We have to use ZWSP for the word breaking in our language. So, We need to
    use ZWSP for line breaking purpose too. Every Burmese word might follow ZWSP
    when automatically adding or operator.
    Please let me have last clarification. Do We need to store ZWSP in
    documents, files and database for the purpose of word segmentation/breaking?
    Or Is it possible to add automatically in others way?

    Let me have your experiences in word breaking and ZWSP issues in your
    language.

    Thanks in advance.

    Ngwe Tun.

    -- 
    In Burmese; Ngwe mean 1) Silver 2) Money 3) Second Awards; Tun mean 1) Light
    2) be prominent.
    


    This archive was generated by hypermail 2.1.5 : Sat Aug 25 2007 - 14:08:12 CDT