From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Mar 23 2005 - 12:08:25 CST
There are various groups actively pursuing the security issues involved in
international domain names. The actual scope of the work is larger, since it
may affect core standards used for other types of identifiers, such as
networked filenames, and so on. Some of the possible approaches to the
security issues are to limit the allowable characters in some ways. There is
a chart as part of the current draft of TR36 that shows a breakdown of
currently-allowed characters:
http://unicode.org/reports/tr36/draft/idn-chars.html . (Remember that there
is a current restriction to U3.2 characters.)
Some issues that I'd like broader feedback on:
A. Currently, compatibility decomposables are mapped to their NFKC form. So
if you type in a half-width katakana form, it will map to the fullwidth
form. There is a proposal to simply forbid compatibility decomposable
instead of mapping them. Is this acceptable (eg in Japan)?
B. There is a proposal to restrict the characters to "LDH" characters
(letters, digits, and hyphen). The closest thing we have in Unicode to that
is the XID_Continue property, so the above chart separates characters out on
that basis. The question is, are there any characters classed there under
"Non-ID" that really should be allowed? (Example: U+0404 ( ״ ) GERSHAYIM?)
B1. Should all of the characters permitted in words in
http://www.unicode.org/reports/tr29/tr29-8.html qualify?
C. Characters with no uppercase in bicameral scripts may be suspect, and
disallowed or flagged. Which of these really need to be allowed? (Example:
U+04C0 ( Ӏ ) PALOCHKA?)
D. The main focus is on characters in modern use. Is there any data that
would let us separate out non-modern-use characters, at least for flagging?
Mark
This archive was generated by hypermail 2.1.5 : Wed Mar 23 2005 - 12:10:02 CST