Re: Unicode Search Engines

From: Mark Davis (mark@macchiato.com)
Date: Thu Feb 21 2002 - 11:54:54 EST


I had misremembered the CharModel on this point when I wrote: "One
would have to have the additional requirement in the Character Model,
that any XML parser that converts an XML document from a legacy
character set into Unicode is not conformant unless it is (actually)
normalizing." There is already that stipulation.

However, the following conditional is pointless.

>unless i) a normalizing transcoder cannot exist for that encoding

- It is always possible for a normalizing transcoder to exist, since
it is always possible to combine a normalizer into a transcoder.
- And it is always possible to transcode from any other set into
Unicode, using PUA code points in the unusual cases.

So taking John's original statement:

> > Documents not in UTF-* are normalized by definition, unless it is
> > *impossible* to convert them to normalized Unicode (typically
> > because they contain characters not yet present in Unicode).

According to the CharModel, it should be simplified to:

"Documents not in UTF-* are normalized by definition."

The point I am concerned about, however, is that all of this seems to
"define away" an issue, which is that there are transcoders out in the
world that are not normalizing; parsers that use them will not produce
the right results unless they normalize the text themselves.

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "François Yergeau" <francois@yergeau.com>
To: "Unicode List" <unicode@unicode.org>
Cc: "w3c-i18n-ig" <w3c-i18n-ig@w3.org>
Sent: Thursday, February 21, 2002 07:51
Subject: Re: Unicode Search Engines

> Mark Davis wrote:
>
> > Simply saying that a document is "normalized by definition" if it
is
> > *possible* to convert it to Unicode would ignore reality, since
> > converters may not *actually* convert it to normalized Unicode.
>
>
> And consequently that is not what the Character Model says. It says
> that legacy data is normalized if it is possible to convert it to
> *normalized* Unicode: "unless i) a normalizing transcoder cannot
exist
> for that encoding".
>
> > One
> > would have to have the additional requirement in the Character
Model,
> > that any XML parser that converts an XML document from a legacy
> > character set into Unicode is not conformant unless it is
(actually)
> > normalizing.
>
>
> This is what the Character Model actually says: "[I] Implementations
> which transcode text data from a legacy encoding to a Unicode
encoding
> form MUST use a normalizing transcoder."
>
>
> Marco Cimarosti wrote:
> >>E.g., ISCII 0xCF + 0xE9 (LETTER RA + SIGN NUKTA) corresponds to
Unicode
> >>U0930 + U093C (DEVANAGARI LETTER RA + DEVANAGARI SIGN NUKTA),
which
> >>is not NFC: it should be U0931 (DEVANAGARI LETTER RRA).
> >>
> >>What should the recipient to when it receives such an ISCII
> >>sequence? Refuse
> >>it because it is not normalized (ISCII itself also contains 0xD0,
> >>LETTER RRA), or "fix" it while converting it to Unicode?
>
>
> Fix it.
>
> --
> François Yergeau
>
>



This archive was generated by hypermail 2.1.2 : Thu Feb 21 2002 - 11:18:52 EST