RE: How to tell Japanese from Chinese.

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Mon Jun 11 2001 - 05:11:49 EDT


Jungshik Shin wrote:
> On Fri, 8 Jun 2001, Marco Cimarosti wrote:
> > > Doesn't this kanji
> > > [...]
> > > usually only appear in Chinese?
> > It seems not. Altavista brings up 721,100 web pages in
> Japanese [...]
> On Fri, 8 Jun 2001, Marco Cimarosti wrote:
> > ... it also found 13,780 Korean pages, [...]
> > ... the funny thing is that it also found 56,565 pages in English:
> > ... and 5 in Italian [...]
> > http://www.altavista.com/sites/search/web?q=%E4%B9%8B&kl=it&se
>
> Well, Altavista doesn't seem to know anything about the document
> encoding. Neither is its ability to detect language of web pages
reliable.
> (did you check some of hits it came up with for English? Some of pages at
> the top are not in English but in Thai !!) [...]

I didn't notice that also the encoding detection was so faulty. Most of the
pages I opened actually contained the Chinese ideographs I was looking for.

But, yes, I noticed how "accurate" Altavista's language detection is. In
fact, I included the 5 "Italian" results mainly for fun: the first two are
actually pages in Chinese talking about Italian wine and opera music. The
third one is a bilingual diplomatic document (the same text in Chinese and
Italian, so there is no reason to say that it is more "Italian" that
"Chinese). The 4rd page is actually a Japanese page about Italian drugs
regulations (that can be excused as most of the text is actually Italian).

Only the last one is in Italian, containing an unintentional Chinese
quotation; the sender ("the pirate") says: "The web that you said only
contains this: [Chinese quotation] so probably something doesn't work". A
nice example of how encoded text can be transmitted intact even by computers
that cannot decode it.

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT