Untitled 1

Subject: IDS in Unihan

From: Mark Davis

Date: 2015-02-03

I would strongly support the addition of IDS to Unihan (http://www.unicode.org/L2/L2015/15065-ids-links.pdf). Among other thing, it would be useful in developing data for UTS #39.

One of the key problems in developing confusability data is the N x N problem, which is especially onerous for Chinese. With the use of IDS, one can generate "rough" similarity metrics between characters. This does not directly generate confusability data, but does allow the generation of small sets of characters that are potential candidates against one another. It is then feasible to have these small sets reviewed by human vetters, because the N x N problem of comparison is drastically reduced.

So having a reliable, publicly-accessible source of IDS mappings for the Unihan characters would be very useful in that regard. (The rendering sophistication of systems like CDL would not be required—and CDL is also not publicly accessible.)

As far as the similarity metrics go, those can be based on the similarity of components, but also use normalization techniques, like identifying the following for the purpose of comparison:

⿱⿰xz⿰yw
⿰⿱xy⿱zw