RE: Normalization forms

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Mon May 13 2002 - 18:07:31 EDT


Hi Lars,

Some information below...

Addison

Addison P. Phillips
Globalization Architect
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
-------------------------------------------------
Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Lars Marius Garshol
> Sent: Monday, May 13, 2002 1:38 PM
> To: unicode@unicode.org
> Subject: Normalization forms
>
>
>
> I have been reading the Unicode Normalization UTR and have a couple of
> questions regarding it:
>
> - will string comparison methods based on NFC and NFD always give the
> same results?

The same results compared to what? If you mean:

if {C}=={c} then {D}=={d}, then the answer is yes.

If you mean:

if {C} == {c} then {C} == {d}, then the answer is no. The forms are not
commutative.

>
> - is it correct that methods based on NFKC and NFKD will give
> different results from ones based on NFC/NFD?

Yes. Emphatically. For example:

U+FF21 is U+FF21 in form C and does not equal U+0041.

but:

U+FF21 in Form KC becomes U+0041...

>
> - if NFC and NFD give the same results, why are both specified? Why
> would an implementation choose one over the other?

Again the question is what you mean by "results". The composed form is
actually different than the decomposed one. It is generally more compatible
with what naive rendering software expects. The decomposed form, by
comparison, makes certain kinds of processing more efficient (for example,
certain kinds of collation processing).

>
> - NFKC/NFKD seem to lose significant information; in what contexts
> are they intended to be used?

They have a number of useful contexts. Namespaces are one. Generally
speaking, the vast majority of characters unified by the compatibility forms
are rendering differences (such as half-width forms, super/sub scripts, and
the like) which make trouble in restricted namespaces (such as programming
identifiers, domain names, and the like). In addition, it is often possible
to introspect more meaning from data input fields by applying K forms.

For example, in some of the webMethods tools GUIs, strings that do not parse
successfully as numbers on the first pass are normalized Form KC (except for
super/subscripts) in order to improve parsing success.

>
> --
> Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net >
> ISO SC34/WG3, OASIS GeoLang TC <URL: http://www.garshol.priv.no >
>
>
>



This archive was generated by hypermail 2.1.2 : Mon May 13 2002 - 18:44:33 EDT