From: Peter Constable (petercon@microsoft.com)
Date: Wed May 25 2011 - 02:12:54 CDT
Argghh! I wrote “Uniscribe normalization”, which is not a well defined concept. I meant “Unicode normalization”.
Peter
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Peter Constable
Sent: Tuesday, May 24, 2011 5:27 PM
To: Unicode Discussion
Subject: RE: Slots for Cyrillic Accented Vowels
Uniscribe normalization is reasonably robust for Latin, Greek and Cyrllic. But it’s simply a fact that NFC normalization can have undesirable effects on various other scripts. In particular, the canonical ordering algorithm used in Unicode normalization can be a problem for various scripts. For example, in Biblical Hebrew, marks will get re-ordered into a sequence that is decidedly not what makes sense for users—the set of general classes (>= 200) and fixed-position classes (< 200) used for Hebrew lead to that result. There are issues for other scripts as well.
These are issues inherent to normalization itself, regardless of the software in use. In those cases, Roozbeh’s point applies: emitting NFC “into the wild” can be as much of problem as emitting NFD.
The only places where Unicode normalization is totally safe are those places for which it was created: not transforming data that will get persisted or transmitted to other users and processes, but in internal processing for comparing strings for the kinds of equivalences that Unicode normalization defines.
Peter
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Roozbeh Pournader
Sent: Tuesday, May 24, 2011 4:28 PM
To: Phillips, Addison
Cc: Christoph Päper; Unicode Discussion
Subject: RE: Slots for Cyrillic Accented Vowels
On Mon, 2011-05-23 at 08:17 -0700, Phillips, Addison wrote:
[...] you generally should not emit NFD "into the wild"
In the real world, of course, you should actually not emit NFC either. A famous case that comes to bite me again and again, is that some XP-era Microsoft applications don't render canonically equivalent strings the same way, so if you normalize something, you lose its preferred display and semantics. For example, the sequence <ARABIC LETTER SEEN, ARABIC SHADDA, ARABIC FATHA>, which is a kind of very normal and rather common sequence in Arabic, will be displayed wrongly in Windows XP's Uniscribe if one actually normalizes it (to either NFC or NFD), becoming <SEEN, FATHA, SHADDA>, which is displayed wrongly in both Notepad and Word 2003 under Windows XP.
Roozbeh
This archive was generated by hypermail 2.1.5 : Wed May 25 2011 - 02:17:52 CDT