RE: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

From: Phillips, Addison <addison_at_lab126.com>
Date: Fri, 1 Feb 2013 08:17:15 -0800

Hi Roger,

(This is a personal response, with chair hat off)

It is very useful to read the big yellow box at the start of that document, which says:

--
This version of this document was published to indicate the Internationalization Core Working Group's intention to substantially alter or replace the recommendations found here with very different recommendations in the near future. Other than this note, this Working Draft is identical to the draft of 2005-10-27.
--
Although the W3C will continue to recommend generating and exchanging data in form NFC where appropriate and for consistency's sake, Early Uniform Normalization cannot be assumed. There are well documented cases of, for example, keyboards that generate de-normalized sequences, filesystems that use other forms, or data that is used to generate content that is not normalized. This content enters the Web in a denormalized state.
Some of the history is documented here: http://www.w3.org/International/wiki/NormalizationProposal
Note well that the above wiki page is *also* not indicative of the current state of the Internationalization Working Group's thinking (it was a proposal discussed last year). The current consensus is that early uniform normalization is not required for the generation of content, that "late normalization" (when comparing strings) is also not required, and that both of these cases are ingrained in the fabric of Web technologies in a way that makes it difficult to change them. Thus, content authors and users are cautioned to use a *consistent* character sequences in their content, with NFC being generally recommended as one way to ensure this.
In point of fact, for most languages in most scripts, content tends to be in form NFC. But you can't count on it. And far from being dead, other normalization forms like NFD are useful for various kinds of processing. 
Addison
Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)
Internationalization is not a feature.
It is an architecture.
> -----Original Message-----
> From: unicode-bounce_at_unicode.org [mailto:unicode-bounce_at_unicode.org] On
> Behalf Of Costello, Roger L.
> Sent: Friday, February 01, 2013 6:07 AM
> To: unicode_at_unicode.org
> Subject: Text in composed normalized form is king, right? Does anyone
> generate text in decomposed normalized form?
> 
> Hi Folks,
> 
> The W3C recommends [1] text sent out over the Internet be in Normalized
> Form C (NFC):
> 
>     This document therefore chooses NFC as the
>     base for Web-related early normalization.
> 
> So why would one ever generate text in decomposed form (NFD)?
> 
> Do any programming languages output text in NFD? Does Java? Python? C#?
> Perl? JavaScript?
> 
> Do any tools produce text in NFD?
> 
> Should I assume that any text my applications receive will always be
> normalized to NFC form?
> 
> Is NFD dead?
> 
> /Roger
> 
> [1] http://www.w3.org/TR/charmod-norm/#sec-ChoiceNFC
> 
Received on Fri Feb 01 2013 - 12:47:07 CST

This archive was generated by hypermail 2.2.0 : Fri Feb 01 2013 - 12:47:07 CST