From: Jonathan Pool (pool@utilika.org)
Date: Wed Sep 19 2007 - 15:40:32 CDT
The University of Washington Turing Center is developing Web applications that
deal with lexical data in, and users via, an unlimited set of languages. The
first public prototype of one of these applications is PanImages, available
for testing at
(press release at http://uwnews.org/article.asp?articleID=36524).
Initial data have been compiled from over 350 machine-readable bilingual and
multilingual dictionaries, and additional data are contributed by users.
This work wouldn't be practical without near-universal adoption of the Unicode
standard, but it still creates a need to choose among normalization algorithms
even though the developers can't be familiar with more than a few of the
affected languages.
In my work on another of these applications, I'm tentatively planning to
normalize all input to NFKD. I'm concerned, though, that (1) some valid
distinctions might thereby be erased, (2) some invalid distinctions may
survive, and (3) some user agents may misrender decomposed strings.
Any thoughts about the best approach to normalization for PanImages and other
applications using the same database would be welcome.
This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 15:42:54 CDT