From: Edward H Trager (ehtrager@umich.edu)
Date: Thu Feb 13 2003 - 15:20:34 EST
On Fri, 14 Feb 2003, Paul Hastings wrote:
> > So I think Zhang Weiwu is suggesting a heuristic algorithm for
> > discriminating a unicode text which is already known, or assumed to be, in
> > Chinese.
>
> well the site will deliver chinese content w/doublechecking browser locale,
> etc. so yes, most likely chinese users.
>
> > to encounter at least one "ge" u+500B or u+4E2A? One "wei" u+70BA or
> > u+4E3A? One "shuo" u+8AAC or u+8BF4? It wouldn't take long to figure
> > this out.
>
> might for me ;-)
>
> > Marco Cimarosti has questioned, why do you need to classify text as being
> > simplified or traditional?
>
> if i understand their needs correctly, its to implement a search system with
> search phrases of either "type" of chinese--content would be in both types.
>
> > So, basically all you would be doing is providing a convenience for your
> > readers, making it easier on their eyes to read your web documents in
> > either traditional or simplified according to their preference. I know
> > that something like that would help me -- sometimes I forget the
> > traditional version of a character, and sometimes I forget the simplified
> > version. It would be very cool if I could just press a button on a web
> > site to switch the display between the two ;-) .
>
> from what i understand this isn't something they've considered but sounds
> pretty cool.
>
In order to implement the search system, they would need to implement the
routine I described to swap the simplified <--> complex characters in the
search expression for at least the most common characters in use today.
The same routine could be used for the display of the documents returned.
Now the relevant question is: How does Google do it?
Well, I tried the search term "xiao shuo" (= a novel) (u+5C0F u+8BF4 in
simplified Chinese, u+5C0F u+8AAC in traditional characters) in Google.
If I use Google's "Language Tools" to select a search domain of either
"simplified" or "traditional" Chinese, I get a different set of results
for the "simplified" vs. the "traditional" pages. Not surprisingly, I get
a lot more "simplified" results (the Mainland is bigger than Taiwan). If
I select "All languages", then I get a *really* big set of results from
Google, because now all of the Japanese pages with "u+5C0F u+8AAC" get
thrown into the mix. This is clearly not what I want! What I would
really want is to be able to type in *either* "u+5C0F u+8AAC" *or*
"u+5C0F u+8BF4", have an option to specify "Chinese" pages (so as to avoid
the Japanese pages that I don't want to see in the result set), and get
the combined set of simplified and traditional pages back from Google.
An additional filter/option should then be available to filter "only"
simplified or traditional pages.
So, based on my relatively simple test with Google, it seems that Search
Engines still leave something to be desired, and Paul Hasting's
people/clients/whoever-they-are may have an idea very worth pursuing.
It would also be worthwhile pursuing this issue with the Google folks.
This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 16:00:26 EST