L2/15-070
Subject: IDS in Unihan
From: Mark Davis
Date: 2015-02-03
One of the key problems in
developing confusability data is the N x N problem, which is especially
onerous for Chinese. With the use of IDS, one can generate "rough"
similarity metrics between characters. This does not directly
generate confusability data, but does allow the generation of small
sets of characters that are potential candidates against one another. It is
then feasible to have these small sets reviewed by human vetters, because
the N x N problem of comparison is drastically reduced.
So having a reliable,
publicly-accessible source of IDS mappings for the Unihan characters would
be very useful in that regard. (The rendering sophistication of systems like
CDL would not be required—and CDL is also not publicly accessible.)
As far as the similarity
metrics go, those can be based on the similarity of components, but also use
normalization techniques, like identifying the following for the purpose of
comparison: