Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Fri Jan 28 2005 - 12:53:05 CST

Next message: Jon Hanna: "RE: Subj: Scotland"

Previous message: Magda Danish \(Unicode\): "FW: Subj: Scotland"
In reply to: Simon Josefsson: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Next in thread: Simon Josefsson: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Reply: Simon Josefsson: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Simon Josefsson wrote:
> There is an online interface to one such implementation at
> <http://josefsson.org/idn.php>, although I would argue that it is
> correct, and not broken, at least until StringPrep/IDN is updated to
> handle this issue.

I would like to offer two online demos using ICU:

IDNA Demo: http://oss.software.ibm.com/cgi-bin/icu/idnademo
Normalization Browser: http://oss.software.ibm.com/cgi-bin/icu/nbrowser

The interesting thing here is that ICU's IDN implementation was modified, in response to the earlier
discussion on this list, to use the broken NFKC implementation, while ICU's normalization API
provides the fixed implementation according to the corrigendum (and the sample code).

Try
http://oss.software.ibm.com/cgi-bin/icu/idnademo?t=%5Cu1100%5Cu0300%5Cu1161
vs.
http://oss.software.ibm.com/cgi-bin/icu/nbrowser?t=&s=1100+0300+1161&uv=0ì

How does this work? Well, I added a hidden flag in an internal header that selects between the
two... There was already a way to select between Unicode 3.2 normalization (for StringPrep/IDNA) and
the current-Unicode normalization (Unicode 4.0.1 in ICU 3.2).

Note: Unfortunately, the URLs above will change within about a month.
They should become available at http://ibm.com/software/globalization/icu/chartsdemostools.jsp

> It would be interesting to find out what percentage of the problem
> sequences are unstable under NFKC.

This might be difficult: There is an infinite number of such sequences since there can be more than
one combining mark between the wrongly composing characters. A comparison would be on the order of
how many even numbers are there compared to all integers.

I propose that
1. Domain name registrars test new registrations for problematic
    domain names and reject them, ASAP.
    For example, ICU's internal flag could be used to normalize
    a string twice and check for differences.
2. Domain names that have already been registered be checked
    for problematic strings.

Number 1. ensures that the problem does not grow, as far as domain names are concerned.
I predict that number 2. will produce an empty set.

Best regards,
markus

-- 
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Next message: Jon Hanna: "RE: Subj: Scotland"
Previous message: Magda Danish \(Unicode\): "FW: Subj: Scotland"
In reply to: Simon Josefsson: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Next in thread: Simon Josefsson: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Reply: Simon Josefsson: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 28 2005 - 12:59:18 CST