From: Ed Trager (ed.trager@gmail.com)
Date: Thu Sep 20 2007 - 09:36:10 CDT
On 9/19/07, Kenneth Whistler <kenw@sybase.com> wrote:
> Note for example, that if you are mixing together language
> data from different sources, you may have to keep track
> of and mark orthographic differences in that data.
> To do comparative searching in such a corpus, you will
> need to be able to do "orthographic folding" -- i.e. be
> able to take one chunk of data in orthography A and
> convert it into orthography B before comparing. Unless
> you are really, really sure of what you are doing, it
> is better to leave the original material as it is,
> and build the orthographic conversions into the application.
As an experiment which illustrates Ken Whistler's point, I tried
searching for "网路" (="network" in simplified (简体) Chinese) in the
PanImages prototype: there were no search results, so I added this
term, matching it to the English word "network". I then searched for
"網路" (="network" in traditional (正體) Chinese) and again there were no
search results. This result suggests that PanImages does not yet
handle the very prevalent case of orthographic folding for Chinese.
Although I did add both the simplified and traditional forms of "网路"
to the PanImages database, I was not very satisfied when I looked at
some of the non-English matches for "network" to which "网路" now
appeared to be attached.
A simple example is that the English word "network" is matched with the French
"construire un réseau" and it now appears that my "网路" entry which is
only a noun and not a verbal form in Chinese may now be incorrectly
associated with verbal definitions such as "construire un réseau" in
the PanImages database.
In summary, my brief review of the PanImages prototype suggests that
there is much work remaining to be done. I am not sure whether the
creators of PanImages have completely grasped the problem domain they
are working in. Orthographic folding is just one problem for which
the solution can be non-trivial in many cases (Chinese is a case in
point).
Beyond this, there are numerous differences in the meanings and usage
domains of words in different languages. Perhaps the "PanImages"
project team should initially focus there efforts on just nouns so
that people can easily find pictures that represent other people's
ideas of "network" but perhaps not people's ideas of "social
networking" or "parcourir" (The English term "network" also matches to
the French verb "parcourir" in the PanImages database). It would be
quite lame if the Chinese word "网路" becomes associated with images
from Flickr and Google that represent "social networking", "construire
un réseau", or "parcourir".
- Ed
unifont.org
This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 09:38:26 CDT