Chinese Word Breaking
c933103 at gmail.com
Wed Jul 22 01:46:57 CDT 2015
Pretty much so, and IMO it is actually quite unnatural to write Chinese
with marking boundaries for word, and even in cases like machine
translation, people would expect the translation engine figure out how
characters should be grouped into words on its own without any markup for
word boundary or so, just like when you type a sentence into machine
translator, you would not expect the machine translator to ask you or show
you which part is subject and which part is verb, etc.
btw, you might want to look up GB/T 13715 standard from mainland China
(PRC) or CNS 14366 standard from Taiwan (ROC) fof some standard that
discuss about how to handle word segmentation when processing Chinese with
2015年7月22日 上午7:37於 "Richard Wordingham" <richard.wordingham at ntlworld.com>寫道：
> On Tue, 21 Jul 2015 18:10:14 +0800
> gfb hjjhjh <c933103 at gmail.com> wrote:
> > When you write text in modern Chinese, there will not be any break
> > between different words, and thus if you segment characters according
> > to the ideographic characters, what being groupped together would
> > either be a clausee or a sentence, Or even a whole paragraph if you
> > are handling some older text without punctuations.
> I had another look at Chinese word breaking algorithms today and saw
> that their practical purposes were mostly indexing and machine
> translation. Consequently, I suspect that authors have little
> incentive to mark word boundaries in the texts they originate. This
> differs from the Thai situation where marking word boundaries improves
> layout and spell-checking.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode