NFC in HTML5 (was: RE: Slots for Cyrillic Accented Vowels)

From: Phillips, Addison (addison@lab126.com)
Date: Thu May 26 2011 - 11:31:09 CDT

  • Next message: announcements@unicode.org: "PRI #184: Proposed Update UTS #37, Unicode Ideographic Variation Database"

    >
    > On Wed, 25 May 2011, Mark Davis wrote:
    >
    > > But that needs to be distinguished from saying that there is something
    > > wrong with 20B9 or with NFC.
    >
    > I did not write that there is something wrong with NFC.
    >
    > I complained that HTML5 or validator http://validator.w3.org/
    > *requires* NFC.
    > This might be a bug in the validator and not actually a requirement of HTML5.
    >
    > Validate my test page
    > http://www.user.uni-hannover.de/nhtcapri/temp/yerushalayim.html
    >

    I believe that the W3C I18N WG does not support or think that it is a good idea for HTML5 to require NFC--but I'm not aware of any normative language in the HTML5 spec that requires it. This page [1] suggests that the normalizer was added to the validator in response to Charmod-Norm (which does not actually require NFC). If someone has a pointer to an NFC requirement in HTML5, it would be most appreciated if you could forward it to www-international@w3.org (or to me privately if you prefer).

    It should be noted that the W3C I18N WG for many years promoted the idea of "early uniform normalization" (EUN) using Form C. However, by the time Charmod was published back in 2005, the WG had decided that EUN was impossible to support. The requirements in the Charmod-Norm working draft were relaxed and the position of the WG was that content authors "should" use NFC when possible and that specifications (like HTML5) would therefore have to deal with non-normalized content ("late normalization"). The current HTML5 prototype validator would thus not be consistent with Charmod-Norm.

    Most recently (as in: last week [2]) the I18N WG reopened its work on normalization, since we are getting requests from HTML, CSS, and elsewhere dealing with the problems inherent in late normalization. The current posture of the WG (subject to change) is:

    - content authors are advised that they should use a consistent form for content or risk problems with matching, etc.
    - content should use NFC where reasonable, especially for identifiers, namespaces, variables, etc. that are not visible to users. Content should not be automatically normalized to NFC by tools except as a by-product of normal operations (transcoding, etc.)
    - comparison and string matching functions in specifications (such as CSS Selectors) should compare strings canonically (that is, two strings are considered equal iff their canonical decompositions are equal; either Form C or Form D can be used to achieve this)

    This update (which does not actually mark a significant departure from the intent of Charmod-Norm) will be taking shape on a working group wiki page over the coming weeks. Contributions, comments, and reviews are invited on the www-international@w3.org and/or our public WG list. Since I haven't posted the first draft of the wiki page yet, I can't point to it.... but I will do so as this effort evolves.

    Regards,

    Addison

    Addison Phillips
    Globalization Architect (Lab126)
    Chair (W3C I18N WG)

    Internationalization is not a feature.
    It is an architecture.

    [1] http://blog.whatwg.org/charmod-norm-checking
    [2] http://www.w3.org/2011/05/25-i18n-minutes.html
         http://www.w3.org/2011/05/18-i18n-minutes.html



    This archive was generated by hypermail 2.1.5 : Thu May 26 2011 - 11:37:27 CDT