2017年12月28日 上午5:34 於 "Karl Williamson via Unicode" <unicode_at_unicode.org> 寫道:
>
> In UTS 39, it says, that optionally,
>
> "Mark Chinese strings as “mixed script” if they contain both simplified
(S) and traditional (T) Chinese characters, using the Unihan data in the
Unicode Character Database [UCD].
>
> "The criterion can only be applied if the language of the string is known
to be Chinese."
>
> What does it mean for the language to "be known to be Chinese"?
As in, the string is written in Chinese language, not Japanese language,
not old Korean/Vietnamese text that use Chinese character, nor any other
languages that use Chinese characters.
According to my knowledge, some Chinese dialects/variants also use both
Simplified and Traditional characters together with different etymology and
that probably shouldn't be considered as mixed script too, although they
aren't really common and is not mentioned in the UTS either.
> Is this something algorithmically determinable, or does it come from
information about the input text that comes from outside the UCD?
>
> The example given shows some Hirigana in the text. That clearly
indicates the language isn't Chinese. So in this example we can
algorithmically rule out that its Chinese.
Usually when there are Japanese kana in the mix then the text would be
Japanese instead of Chinese. However the reverse is not necessarily true,
especially for a single word or short pharse, older styled text and such,
where a string with only Chinese characters can still be a Japanese text.
>
> And what does Chinese really mean here?
>
The written form of the (Mandarin) Chinese language?
Received on Wed Dec 27 2017 - 16:20:35 CST
This archive was generated by hypermail 2.2.0 : Wed Dec 27 2017 - 16:20:36 CST