Normalization in panlingual application

From: Jonathan Pool (pool@utilika.org)
Date: Wed Sep 19 2007 - 15:40:32 CDT

  • Next message: Asmus Freytag: "Re: Normalization in panlingual application"

    The University of Washington Turing Center is developing Web applications that
    deal with lexical data in, and users via, an unlimited set of languages. The
    first public prototype of one of these applications is PanImages, available
    for testing at

    http://www.panimages.org

    (press release at http://uwnews.org/article.asp?articleID=36524).

    Initial data have been compiled from over 350 machine-readable bilingual and
    multilingual dictionaries, and additional data are contributed by users.

    This work wouldn't be practical without near-universal adoption of the Unicode
    standard, but it still creates a need to choose among normalization algorithms
    even though the developers can't be familiar with more than a few of the
    affected languages.

    In my work on another of these applications, I'm tentatively planning to
    normalize all input to NFKD. I'm concerned, though, that (1) some valid
    distinctions might thereby be erased, (2) some invalid distinctions may
    survive, and (3) some user agents may misrender decomposed strings.

    Any thoughts about the best approach to normalization for PanImages and other
    applications using the same database would be welcome.



    This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 15:42:54 CDT