Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Fri Jan 28 2005 - 12:53:05 CST

  • Next message: Jon Hanna: "RE: Subj: Scotland"

    Simon Josefsson wrote:
    > There is an online interface to one such implementation at
    > <http://josefsson.org/idn.php>, although I would argue that it is
    > correct, and not broken, at least until StringPrep/IDN is updated to
    > handle this issue.

    I would like to offer two online demos using ICU:

    IDNA Demo: http://oss.software.ibm.com/cgi-bin/icu/idnademo
    Normalization Browser: http://oss.software.ibm.com/cgi-bin/icu/nbrowser

    The interesting thing here is that ICU's IDN implementation was modified, in response to the earlier
    discussion on this list, to use the broken NFKC implementation, while ICU's normalization API
    provides the fixed implementation according to the corrigendum (and the sample code).

    Try
    http://oss.software.ibm.com/cgi-bin/icu/idnademo?t=%5Cu1100%5Cu0300%5Cu1161
    vs.
    http://oss.software.ibm.com/cgi-bin/icu/nbrowser?t=&s=1100+0300+1161&uv=0ì 

    How does this work? Well, I added a hidden flag in an internal header that selects between the
    two... There was already a way to select between Unicode 3.2 normalization (for StringPrep/IDNA) and
    the current-Unicode normalization (Unicode 4.0.1 in ICU 3.2).

    Note: Unfortunately, the URLs above will change within about a month.
    They should become available at http://ibm.com/software/globalization/icu/chartsdemostools.jsp

    > It would be interesting to find out what percentage of the problem
    > sequences are unstable under NFKC.

    This might be difficult: There is an infinite number of such sequences since there can be more than
    one combining mark between the wrongly composing characters. A comparison would be on the order of
    how many even numbers are there compared to all integers.

    I propose that
    1. Domain name registrars test new registrations for problematic
        domain names and reject them, ASAP.
        For example, ICU's internal flag could be used to normalize
        a string twice and check for differences.
    2. Domain names that have already been registered be checked
        for problematic strings.

    Number 1. ensures that the problem does not grow, as far as domain names are concerned.
    I predict that number 2. will produce an empty set.

    Best regards,
    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Fri Jan 28 2005 - 12:59:18 CST