From: George W Gerrity (g.gerrity@gwg-associates.com.au)
Date: Tue Feb 22 2005 - 08:10:49 CST
On 22 Feb 2005, at 16:18, Erik van der Poel wrote:
> OK, I think this latest flurry of emails is starting to form a picture
> of What We Must Do (re: nameprep, IDN spoofing and the registries):
>
> Basically, the work to install filters at the registries, and the work
> to write the next version of the nameprep spec can proceed in
> parallel, pretty much independently.
Yep.
> As George points out, the registries are going to have to start
> filtering IDN lookalikes, otherwise they will eventually face lawsuits
> from the "big boys" (as George so delightfully puts it). The ccTLDs
> will have a relatively easy task, while the gTLDs like .com will have
> the difficult task of deciding which subset of Unicode to allow.
I think that I suggested that the ccTLDs would decide their own
encoding for the TLD tag, but I was wrong to separate that from the
general problem of what acceptable foldings/encodings should be applied
to all TLDs. In any case, the more difficult problems occur underneath.
> They will also have to go through their database, looking for
> lookalikes, and deleting them or transferring them to new owners,
> probably paying their previous owners back. The registrars might have
> to be involved in the money transaction too. What a mess. I don't envy
> the gTLDs. Maybe the Unicode Consortium could help them out by
> providing homograph tables.
Yes, but there shouldn't be too many problems, yet. That's why the
quicker one gets going, the better.
> One possible approach for the gTLDs is to halt IDN registration now.
> Then they can work on their filters, starting with a small subset of
> Unicode. After reopening IDN registration, they can grow the subset if
> there is enough demand for characters outside the initial subset.
I don't think that it needs to be halted now. Just give them a quick
filter in perl to sieve out names whose code points don't all come from
one script. The few that get caught can either be delayed until better
filters come along, or they can be handled on a once-off basis based on
guidelines that one can put into place pretty quickly.
> If the gTLDs are going to do some serious subsetting, then they will
> probably also need to provide software to the registrars that will map
> users' characters into the subset. E.g. converting a user's local
> charset to the subset of Unicode.
At this point, I took off two hours to download and read the relevant
RFCs dealing with IDNA, as I haven't been following the TLD
standardisation process that carefully until now. Having done my
homework, I still fail to understand why it is up to IDNA to provide
mappings between local charsets and Unicode. These mappings are already
available, but are not one-to-one: some local codes have no equivalent
in Unicode, and vice-versa.
> Then again, this might be an area where registrars could compete with
> each other, to provide the most friendly software to the end-user
> (registrant).
Any mapping is at the registration level, including subsetting to
preclude spoofing. We imagine that mapping is many-to-one to yield a
“canonical name”, which is the one registered. A typical example of
mapping is the case-folding of ASCII names that is already processed to
all lower-case (which is the “canonical mapping” result, and is used in
the DNS and in security certificates). Basically, one submits a
proposed name for registration, and it is either accepted or refused.
If refused, two reasons can be given: a) the canonical form of the name
is already taken; or b) the name is not well-formed according to the
restrictions applied by subsetting algorithms designed to minimise
spoofing.
It is not apparent to me where the question of user interface design is
applicable, as all this happens “under the hood”, so to speak.
> On the other side, we have the nameprep spec, and the work required to
> rev it. As John Klensin points out in another email, nameprep will
> eventually have to be updated to include new Unicode characters.
> Nameprep specifies Unicode 3.2, but Unicode itself is already at
> 4.0.1, and may be even further along by the time we finish discussing
> and drafting nameprep bis (new version). Call it nameprep2.
Nameprep has nothing to do with the type of filtering we are
discussing. The registration process procedes as follows:
--------------- 1 --------- -------------
|<Local_Charset>| --> |<Unicode>| --> |<nameprep_vx>| -->
--------------- --------- -------------
-------------------- ----------
|<Subsetting_Filters>| --> |<Punicode>|
-------------------- ----------
Point 1 may fail if there is no mapping: not too likely for names
people will want to use, but may fail for, say, Japanese formal names
with character variants that are still not completely encoded.
<nameprep_vx> may, besides the mapping to the specified Unicode
normalisation, where appropriate perform case-folding. The algorithm to
perform this operation really should never fail.
The <Subsetting_Filters> are exactly those filters designed to reduce
the possible namespace to a tractable size, in which there will be no
names that are possible spoofs of other names. Some of them may be the
same for all TLDs, while others will be specific to a given TLD. The
ones that are the same for all TLDs can migrate into <nameprep_vx> at a
later date, if necessary.
Only those names getting through these filters will automatically be
considered as candidates for registration. Alternatively, the
<Subsetting_Filters> may have two outputs: those that get through the
strict rules, and those that get through a coarse sieve but not the
finer ones. Those that get through the coarse sieve would need to be
turned over to an expert in orthography for further study before
allowing registration.
> Now, one item that is clearly on nameprep2's table is the new version
> of Unicode. Another item that could be considered is the banning of
> slash '/' homographs and others.
<nameprep_vx> is not the place to deal with homographs, except perhaps
for the obvious, such as mapping full- or half-width numerals,
Cyrillic, Greek, and Latin characters to the equivalent homographs in
the normal script areas for them.
> This type of spoofing was recently discussed on the IDN list. Certain
> Unicode blocks, like the math characters,
The Math characters, yes.
> might also be banned instead of mapped as they are now.
Or we could wait and put them in the <Subsetting_Filters> for the
moment.
> And I'm sure we would discuss mapping or banning the homographs, such
> as Cyrillic small 'a'.
No. For the moment, these sort of homographs need to be included in the
<Subsetting_Filters> area. Otherwise, we will be precluding any sort of
mixed-script names. The problem is not just that of identifying
homographs, but of also determining what is a homograph of what, and
where to distinguish. Thus, we want to ban Cyrillic homographs in the
“XML” portion of the mixed name “XML-россия”, and Latin homographs in
the Cyrillic part of the name. We also want to ban homographs from
other scripts, such as Greek or Coptic from either part. We can't do
that with an all-encompassing algorithm: it needs to be taylored to the
particular TLD.
> A lot of this is likely to be controversial, and some people might
> suggest that we leave the subsetting to the registries, since they
> have to do it anyway. So, instead of shrinking the character set,
> nameprep2 might just grow it (for the new version of Unicode). I don't
> know. We'll see.
The controversial bits belong in the <Subsetting_Filters> component,
and probably will be local. Those parts that are going to be global
belong (ultimately) in <nameprep_vx>.
> I'm not sure whether we would need a new ACE prefix if we are only
> adding characters (and not removing any). I'm too tired right now to
> think about it.
Why? The problem of backward compatibility won't occur for additions,
and where new rules subset the name space for all TLDs, they pruning of
previously legitimate names will have to occur, anyway. BTW, it might
make sense to add a first registration date to a name, like in
copyright, so that names that are pruned are those registered after the
original one of the lookalike set.
George
This archive was generated by hypermail 2.1.5 : Tue Feb 22 2005 - 08:12:48 CST