From: Neil Harris (neil@tonal.clara.co.uk)
Date: Thu Oct 05 2006 - 18:51:29 CST
Jefsey_Morfin wrote:
> There is a confusion between the need (IDN) and a solution (IETF
> IDNA). Regulating the need will not correct the lacks of the solution.
> The solution must be fool proof. How? In having all the confusive
> strings being converted into the same ACE ("xn--" ASCII equivalence).
>
> Is it possible? Yes. A grapheme is a graphic concept which can be
> mathematically documented. The problem is that Unicode assigns numbers
> to these concepts in a polyonomous manner. So we need a
> Unicode/Grapheme table. Either in comparing the characters' mathematic
> descriptions through their integrals (graphemes). Or in capitalizing
> on experience. To obtain a table of characters graphic families (
> another way to list graphemes).
>
> A "super punycode" version will use this table to transcode in the
> same ASCII sequence all the characters of the same family. This will
> remove none of the possibilities of the current solution, but it will
> prevent two different ACE from being seen in the same way. Because
> there will be only one possible ACE possible. This will not reduce the
> possibility of that ACE to fully support all the existing confusive
> labels. The disadvantages of not using that "super punycode" function
> will probably make it used quickly. The drawback is that some existing
> names may be confused with other names.This is why the need is urgent
> (there is a limited number of IDNs and no many confusive ones
> [confusive labels are at higher levels]). If this was a real
> difficulty, the solution is proposed is to use another prefix than
> "xn--" (this would help addressing another type of problem).
>
> jfc
>
>
I've actually written code to try to work out homograph resemblances.
It's harder than you might think, and brings up a huge range of problems
related to visual perception.
Graphemes are actually rather hard-to-define entities, and if anything
harder to define than characters. Consider the huge differences in
letterforms found between fonts that are in widespread use, and then
doing this between hundreds of font variants in dozens of writing
systems -- and that's before you even start to consider Chinese, where
confusables can occur for cultural, not visual, reasons. Douglas
Hofstadter devoted a lot of time to thinking about and demonstrating the
possibilities of this in his book "Fluid Concepts and Creative Analogies".
However, there's a bigger issue: even if your approach was to be the One
True Way for doing IDN, I wouldn't hold your breath waiting for it to
happen. The IDN process has already taken more than five years, and
there are already over twenty live IDN-enabled domains already operating
[1], and the big three browsers, and increasingly other software, can
now all support the current IDN implementation. Changing it now would
make turning a supertanker around look easy by comparison. [2]
-- Neil
[1] See
http://www.mozilla.org/projects/security/tld-idn-policy-list.html for a
partial list of IDN-enabled domains
[2] not to mention the other exciting technical issues that would be
involved in ever changing the ACE prefix
This archive was generated by hypermail 2.1.5 : Thu Oct 05 2006 - 18:52:15 CST