RE: Unicode Search Engines

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Feb 20 2002 - 14:04:32 EST


John Cowan wrote:
> Documents not in UTF-* are normalized by definition, unless it is
> *impossible* to convert them to normalized Unicode (typically
> because they contain characters not yet present in Unicode).

Is that true for all encodings?

E.g., ISCII 0xCF + 0xE9 (LETTER RA + SIGN NUKTA) corresponds to Unicode
U0930 + U093C (DEVANAGARI LETTER RA + DEVANAGARI SIGN NUKTA), which is not
NFC: it should be U0931 (DEVANAGARI LETTER RRA).

What should the recipient to when it receives such an ISCII sequence? Refuse
it because it is not normalized (ISCII itself also contains 0xD0, LETTER
RRA), or "fix" it while converting it to Unicode?

_ Marco



This archive was generated by hypermail 2.1.2 : Wed Feb 20 2002 - 14:02:10 EST