From: John Jenkins (jenkins@apple.com)
Date: Sat Oct 11 2003 - 18:19:04 CST
On 2003¦~10¤ë10¤é, at ¤U¤È2:48, Magda Danish (Unicode) wrote:
>> My problem is to recognize from the 32 bit value of unicode
>> character if this
>> is a chinese character or korean or japanese. How can do this?
>>
It's basically impossible and largely meaningless.  It's the equivalent 
of asking if "a" is an English letter or a French one.  There are 
*some* characters where one can guess based on the source information 
in Unihan.txt that it's traditional Chinese, simplified Chinese, 
Japanese, Korean, or Vietnamese, but there are too many exceptions to 
make this really reliable.  (For example, one particularly nasty 
obscenity in Cantonese would probably have never been encoded for 
Cantonese, but has made it in for the sake of Korean, where one hopes 
it isn't nearly as obscene.)
The phonetic data in Unihan.txt should not be used for this purpose.  A 
blank in the phonetic data means that nobody's supplied a reading, not 
that a reading doesn't exist.  Because updating the Unihan database is 
an ongoing process, these fields will be increasingly filled out as 
time goes on, but they should never be taken as absolutely complete.  
In particular, there are obscure characters where it is known that 
there *is* a reading, but since the character does not occur in 
standard dictionaries, we are unable to supply it (e.g., U+40DF in 
Cantonese).
A better solution is to look at the text as a whole:  if there's a fair 
amount of kana, it's probably Japanese, and if there's a fair amount of 
hangul, it's probably Korean.
The only proper mechanism is, as for determining whether "chat" is 
spelled correctly in English or French, is to use a higher-level 
protocol.
========
John H. Jenkins
jenkins@apple.com
jhjenkins@mac.com
http://homepage..mac.com/jhjenkins/
This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST