From: Atif Gulzar (atif.gulzar@gmail.com)
Date: Thu Jan 29 2009 - 22:30:06 CST
Hi,
I have checked and could not find any Unicode character for word
separator (zero width space as WORD separator). This character/code is
needed for languages where space is not used as word separator. The
available zero width characters are incapable to address this issue.
e.g.
U+200B Zero Width Space: This character is intended for line break
control (In Lao language lines can be broken at syllable levels, Lao
uses U+200B to mark syllable boundaries).
U+200C Zero Width Non Joiner: Used to separate ligatures in cursive scripts
U+200D Zero Width Joiner: Used in cursive scripts to generate a
joining shape forms
U+2060 Word Joiner: A zero width non-breaking space (where words
should not break at linebreak)
Algorithms can be devised for word segmentation but its a laborious
task has to be performed every time before any language processing
algorithm like spelling check, next word, find exact word etc. There
should be some charters that can be inserted (once) at word boundaries
by algorithm.
-- Best Regards, Atif Gulzar I ◘◘◘◘ Unicode, ɹɐzlnƃ ɟıʇɐ
This archive was generated by hypermail 2.1.5 : Thu Jan 29 2009 - 22:33:54 CST