Re: Unicode Search Engines

From: Mark Davis (mark@macchiato.com)
Date: Wed Feb 20 2002 - 22:50:00 EST


> > Documents not in UTF-* are normalized by definition, unless it is
> > *impossible* to convert them to normalized Unicode (typically
> > because they contain characters not yet present in Unicode).

I think this goes too far. By definition, a normalizing character
encoding converter (of a particular type: NFC or NFD) always produces
normalized Unicode (of the respective type).

Many character encodings are, if converted 1:1 to Unicode,
automatically NFC. However, many other character encodings are, if
converted 1:1 to Unicode, not automatically NFC. For example, if the
legacy encoding has any combining marks, then they have to be
correctly ordered when converted into Unicode if the result is to be
normalized. Other encodings may require splitting or combining as
well. A 1:1 character converter will *not* produce the right result.

Simply saying that a document is "normalized by definition" if it is
*possible* to convert it to Unicode would ignore reality, since
converters may not *actually* convert it to normalized Unicode. One
would have to have the additional requirement in the Character Model,
that any XML parser that converts an XML document from a legacy
character set into Unicode is not conformant unless it is (actually)
normalizing.

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Marco Cimarosti" <marco.cimarosti@essetre.it>
To: "'John Cowan'" <cowan@mercury.ccil.org>
Cc: "'Stefan Probst'" <stefan.probst@opticom.v-nam.net>; "Doug Ewell"
<dewell@adelphia.net>; "Unicode List" <unicode@unicode.org>
Sent: Wednesday, February 20, 2002 11:04
Subject: RE: Unicode Search Engines

> John Cowan wrote:
> > Documents not in UTF-* are normalized by definition, unless it is
> > *impossible* to convert them to normalized Unicode (typically
> > because they contain characters not yet present in Unicode).
>
> Is that true for all encodings?
>
> E.g., ISCII 0xCF + 0xE9 (LETTER RA + SIGN NUKTA) corresponds to
Unicode
> U0930 + U093C (DEVANAGARI LETTER RA + DEVANAGARI SIGN NUKTA), which
is not
> NFC: it should be U0931 (DEVANAGARI LETTER RRA).
>
> What should the recipient to when it receives such an ISCII
sequence? Refuse
> it because it is not normalized (ISCII itself also contains 0xD0,
LETTER
> RRA), or "fix" it while converting it to Unicode?
>
> _ Marco
>
>



This archive was generated by hypermail 2.1.2 : Wed Feb 20 2002 - 22:30:30 EST