Hi Roger,
The situation is complex. Few applications and web services bother with
normalisation, so what you get, I.e. NFC or NFD or other ... often depends
on which language you are using and what input framework you are using.
Some keyboard layouts will produce NFC output,
some keyboard layouts will not produce either NFC or NFD.
some keyboard layouts will produce NFD.
some keyboards layouts may produce NFD if the typist enters the characters
in the right order, if the language uses multiple combining diacritics and
some of combining diacritics do not interact typographically.
You need very specific input frameworks supporting constraints and
reordering to guarantee either NFC or NFD for some languages.
And for some languages, different keyboard layouts will produce different
output. Ie some Vietnamese input tools produce NFC, while others do not
produce NFC or NFD.
Library data is also problematic. Some ILMs will out put NFC but this is
not the norm. Usually they will leave it in its internal format. For
MARC21, the character repertoire taken as a whole will produce data that is
northern NFC nor NFD, but if you look at subsets of data by language, a lot
of the data is effectively NFD. But not all.
Andrew
On Feb 2, 2013 1:19 AM, "Costello, Roger L." <costello_at_mitre.org> wrote:
> Hi Folks,
>
> The W3C recommends [1] text sent out over the Internet be in Normalized
> Form C (NFC):
>
> This document therefore chooses NFC as the
> base for Web-related early normalization.
>
> So why would one ever generate text in decomposed form (NFD)?
>
> Do any programming languages output text in NFD? Does Java? Python? C#?
> Perl? JavaScript?
>
> Do any tools produce text in NFD?
>
> Should I assume that any text my applications receive will always be
> normalized to NFC form?
>
> Is NFD dead?
>
> /Roger
>
> [1] http://www.w3.org/TR/charmod-norm/#sec-ChoiceNFC
>
>
>
Received on Sat Feb 02 2013 - 01:17:27 CST
This archive was generated by hypermail 2.2.0 : Sat Feb 02 2013 - 01:17:29 CST