From: Eric Muller (emuller@adobe.com)
Date: Sat Mar 17 2007 - 10:20:20 CST
Daniel Ehrenberg wrote:
> I'm just wondering, are there any other programming languages that
> handle Unicode by storing strings in a consistently normalized form?
I don't know of any, but you should realize that this comes at a
functional cost.
Consider writing a text editor and consider the Windows Vietnamese
keyboard. Because of the layout of this keyboard, data entered with it
is not in a normalized form; for example, ễ is entered by hitting two
keystrokes, the first generating U+00EA ê LATIN SMALL LETTER E WITH
CIRCUMFLEX, the second generating U+0303 ◌̃ COMBINING TILDE. Your
approach means that the stored text is either <U+1EC5 ễ LATIN SMALL
LETTER E WITH CIRCUMFLEX AND TILDE> (if you choose NFC) or <U+0065 e
LATIN SMALL LETTER E, U+0302 ◌̂ COMBINING CIRCUMFLEX ACCENT, U+0303 ◌̃
COMBINING TILDE> (if you choose NFD). In either case, the number of
characters see by the editor and the number of keystrokes do not match.
If you want to build your editor so that <any key, delete> is a no-op,
then you need to compensate for this mismatch, and in fact you need to
have a detailed knowledge of the keyboard in your editor. This sound a
bit much to me.
Another area where normalization is painful is if you intend to support
other character sets than Unicode and achieve that by using the
round-trip capabilities of Unicode (the tenth design principle:
"Accurate convertibility is guaranteed between the Unicode Standard and
other widely accepted standards"). These round-trip capabilities are
guaranteed only if data is not normalized on the way. The most obvious
case are the CJK compatibility ideographs which have been encoded
precisely for the purpose of round-tripping, yet disappear if
normalization is applied.
Personally, my rule of thumb (when building software) is to not
normalize until explicitly asked by the user, or unless I know that the
resulting data will have limited uses for which normalization does not
interfere. The lower in the food chain my software is (and a general
purpose programming language runtime is about as low as one can get) the
more I follow this rule.
Eric.
This archive was generated by hypermail 2.1.5 : Sat Mar 17 2007 - 10:23:27 CST