Re: Hanzi trad-simp folding and z-variants from Stephan Stiller on 2013-06-07 (Unicode Mail List Archive)

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Fri, 07 Jun 2013 13:00:42 -0700

Hi John,

This is one of those questions that I've been wondering about as well
... my guess would be "yes that should work (and dealing with z-variants
is something you'll likely need to do anyways)", but there *must* be
some published algorithm out there that specifically addresses the issue
of diffferentiable and recoverable folding for indexing.

This comes up in NLP all the time for case folding. My impression is
that the folks there just fold everything into lowercase and later apply
a so-called truecasing algorithm (aka truecaser). To someone like me
this just seems like totally the wrong approach, but I'll be open to be
convinced otherwise with the right empirical arguments.

If you find some information on data structures and algorithms tailored
to this problem in the area of indexing/querying, let me know.

Stephan

On 6/6/2013 12:54 PM, John D. Burger wrote:
> Hi there -
>
> I'm working on an information retrieval application for a collection of Chinese documents, which appear to use a mix of traditional and simplified characters. My intuition is that it makes sense to do traditional to simplified folding for indexing and query processing (when the mapping is unambiguous), but I'd be interested in opinions about this.
>
> Second, I just noticed the kZVariant field in the Unihan.zip file. It seems to me that it makes sense to fold these together as well, correct?
>
> Thanks for any information you care to provide.
>
> - John Burger
> MITRE
>
Received on Fri Jun 07 2013 - 15:06:16 CDT

This archive was generated by hypermail 2.2.0 : Fri Jun 07 2013 - 15:06:18 CDT