Re: Normalization in panlingual application

From: Ed Trager (ed.trager@gmail.com)
Date: Thu Sep 20 2007 - 09:36:10 CDT

  • Next message: Asmus Freytag: "Re: Normalization in panlingual application"

    On 9/19/07, Kenneth Whistler <kenw@sybase.com> wrote:

    > Note for example, that if you are mixing together language
    > data from different sources, you may have to keep track
    > of and mark orthographic differences in that data.
    > To do comparative searching in such a corpus, you will
    > need to be able to do "orthographic folding" -- i.e. be
    > able to take one chunk of data in orthography A and
    > convert it into orthography B before comparing. Unless
    > you are really, really sure of what you are doing, it
    > is better to leave the original material as it is,
    > and build the orthographic conversions into the application.

    As an experiment which illustrates Ken Whistler's point, I tried
    searching for "网路" (="network" in simplified (简体) Chinese) in the
    PanImages prototype: there were no search results, so I added this
    term, matching it to the English word "network". I then searched for
    "網路" (="network" in traditional (正體) Chinese) and again there were no
    search results. This result suggests that PanImages does not yet
    handle the very prevalent case of orthographic folding for Chinese.

    Although I did add both the simplified and traditional forms of "网路"
    to the PanImages database, I was not very satisfied when I looked at
    some of the non-English matches for "network" to which "网路" now
    appeared to be attached.

    A simple example is that the English word "network" is matched with the French
    "construire un réseau" and it now appears that my "网路" entry which is
    only a noun and not a verbal form in Chinese may now be incorrectly
    associated with verbal definitions such as "construire un réseau" in
    the PanImages database.

    In summary, my brief review of the PanImages prototype suggests that
    there is much work remaining to be done. I am not sure whether the
    creators of PanImages have completely grasped the problem domain they
    are working in. Orthographic folding is just one problem for which
    the solution can be non-trivial in many cases (Chinese is a case in
    point).

    Beyond this, there are numerous differences in the meanings and usage
    domains of words in different languages. Perhaps the "PanImages"
    project team should initially focus there efforts on just nouns so
    that people can easily find pictures that represent other people's
    ideas of "network" but perhaps not people's ideas of "social
    networking" or "parcourir" (The English term "network" also matches to
    the French verb "parcourir" in the PanImages database). It would be
    quite lame if the Chinese word "网路" becomes associated with images
    from Flickr and Google that represent "social networking", "construire
    un réseau", or "parcourir".

    - Ed
      unifont.org



    This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 09:38:26 CDT