Re: Unicode Search Engines

From: Mark Davis (mark@macchiato.com)
Date: Wed Jan 30 2002 - 10:42:28 EST


yes, thanks.

marq
—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: <Misha.Wolf@reuters.com>
To: "Mark Davis" <mark@macchiato.com>
Cc: <unicode@unicode.org>
Sent: Wednesday, January 30, 2002 07:48
Subject: Re: Unicode Search Engines

>
> On 30/01/2002 15:30:06 Mark Davis wrote:
> > It is not a 'fatal flaw'. NFD makes to pretensions to represent
the
>
> I imagine that "to" -> "no".
>
> Misha
>
> > most 'natural' ordering for any given language. Out of all the
> > possible canonically equivalent sequences, it is simply a
specific,
> > well-defined, unique representation that is fully decomposed.
> >
> > The issue of canonical equivalence itself is that that the
circumflex
> > and dot-below can come in any order and have precisely the same
> > appearance, *and* that we could not predict the 'natural' order
for
> > any given language.
> >
> > Mark
> > —————
> >
> > Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ
> > [For transliteration, see
http://oss.software.ibm.com/cgi-bin/icu/tr]
> >
> > http://www.macchiato.com
> >
> > ----- Original Message -----
> > From: <DougEwell2@cs.com>
> > To: <unicode@unicode.org>
> > Cc: <stefan.probst@opticom.v-nam.net>
> > Sent: Tuesday, January 29, 2002 22:51
> > Subject: Re: Unicode Search Engines
> >
> >
> > > In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
> > > stefan.probst@opticom.v-nam.net writes:
> > >
> > > > I would like to add:
> > > > How do they handle normalization?
> > > > In Vietnam, many characters can be represented in several
> > different ways:
> > > > (1) fully precomposed (NFC)
> > > > (2) base character and modifier precomposed, tonal mark
combining
> > > > (3) base character, then modifier, then tonal mark
> > > > (4) like (3), but modifier and tonal mark sorted (NFD)
> > > > Do the search engines do any normalization, before indexing a
> > page?
> > > > Are queries normalized before running the search?
> > >
> > > I'm not sure what sort of normalization might be performed by
search
> > engines,
> > > but I want to examine the Vietnamese decomposition aspect for a
> > moment.
> > >
> > > If you have a Vietnamese vowel with both modifier and tone mark,
say
> > LATIN
> > > CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can
represent
> > this in
> > > Unicode in at least three ways:
> > >
> > > (1) fully precomposed (NFC) -- that is, U+1EA4
> > > (2) base character and modifier precomposed, tonal mark
combining --
> > that is,
> > > U+00C2 U+0301
> > > (3) base character, then modifier, then tonal mark -- that is,
> > U+0041 U+0302
> > > U+0301
> > >
> > > So far, so good. But then we have:
> > >
> > > > (4) like (3), but modifier and tonal mark sorted (NFD)
> > >
> > > If "sorting" the diacritical marks in NFD results in rearranging
the
> > two
> > > diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then
in
> > terms of
> > > Vietnamese orthography, the NFD form may not really be a
legitimate
> > way of
> > > representing the Vietnamese letter.
> > >
> > > For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND
DOT
> > BELOW is,
> > > in Vietnamese, a circumflexed A to which a tone mark (dot below)
has
> > been
> > > added. It is not a dotted-below A to which a circumflex has
been
> > added. Yet
> > > because of the canonical combining classes of the two
diacriticals
> > (230 for
> > > COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the
> > latter is how
> > > the character will be decomposed.
> > >
> > > In theory, there is actually a case 5: base character and tonal
mark
> > > precomposed, modifier combining. In terms of Vietnamese
> > orthography, this is
> > > just as illegitimate as case 4 (NFD), but most software that
> > processes
> > > Vietnamese text will probably never encounter it. But it will
have
> > to handle
> > > the NFD case.
> > >
> > > If I were on some other mailing lists I could think of, I would
> > claim that
> > > this is a fatal flaw in the design of Unicode Normalization Form
D.
> > It's
> > > not, but it is a sticky problem that needs to be dealt with when
> > dealing with
> > > Vietnamese text.
> > >
> > > -Doug Ewell
> > > Fullerton, California
> > >
> > >
> >
> >
>
> -------------------------------------------------------------- --
> Visit our Internet site at http://www.reuters.com
>
> Any views expressed in this message are those of the individual
> sender, except where the sender specifically states them to be
> the views of Reuters Ltd.
>



This archive was generated by hypermail 2.1.2 : Wed Jan 30 2002 - 10:27:28 EST