RE: How to tell Japanese from Chinese.

From: Jungshik Shin (jshin@mailaps.org)
Date: Fri Jun 08 2001 - 13:38:03 EDT


On Fri, 8 Jun 2001, Marco Cimarosti wrote:

> ªÆªóªÉ wrote:
> > Doesn't this kanji
> > <bad-ascii-art>
> > |
> > -------
> > /
> > _/
> > _/
> > / |____
> > </bad ascii art>
> > NOT to be confused with hiragana "e" (oy vey),
> > usually only appear in Chinese?
>
> It seems not. Altavista brings up 721,100 web pages in Japanese containing
> "ñý" (U+4E4B):
>
> http://www.altavista.com/sites/search/web?q=%E4%B9%8B&kl=ja&search=Search&pg
> =q
>
> ... it also found 13,780 Korean pages, despite the fact that Korean is
> mostly written in Hangul:

> ... the funny thing is that it also found 56,565 pages in English:
>
> http://www.altavista.com/sites/search/web?q=%E4%B9%8B&kl=en&search=Search
>
> ... and 5 in Italian
>
> http://www.altavista.com/sites/search/web?q=%E4%B9%8B&kl=it&search=Search

  Well, Altavista doesn't seem to know anything about the document
encoding. Neither is its ability to detect language of web pages reliable.
(did you check some of hits it came up with for English? Some of pages at
the top are not in English but in Thai !!) So, when asked to find pages
with the octet sequence E4 B9 8B (UTF-8 'representation' of U+4E4B), what
it does is blindly look for the octet sequence regardless of document
encodings used. This would have worked more or less if the majority of
web pages had been written in UTF-8. However, that's not the case even
though we wish that'll be the case in the near future. As a result,
you ended up with numerous false hits.

  What you have to do is use the octet sequence of the representation
of U+4E4B in encodings (widely used for the language) instead.
For instance, to find Korean pages with U+4E4B, you have to use

  http://www.altavista.com/sites/search/web?q=%F1%FD&kl=ko&search=Search&pg=q

where 'F1 FD' is the EUC-KR representation of U+4E4B. For Japanese,
you have to run two or three searches (EUC-JP, Shift_JIS, and possibly
ISO 2022 JP).

  How many did I get? Only 6 (instead of 13k) :-). Moreover, none of
them is in Korean (I don't know why Altavista thought they're
Korean. Some of them even have meta tag specifying MIME charset ISO-8859-1
and Windows-1252 !!). Of course, this does NOT mean that there's no
Korean page with U+4E4B. There should be some (e.g. pages annotating
Korean/Chinese classics with Hangul, pages with some widely quoted
proverbs/maxims/phrase from Chinese/Korean classics which can be written
in both Hangul and Chinese characters)

  Jungshik Shin



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT