Re: Reviewing IETF documents

From: DougEwell2@cs.com
Date: Mon Apr 16 2001 - 14:18:30 EDT


In a message dated 2001-04-16 9:19:36 Pacific Daylight Time,
Mike_Ayers@bmc.com writes:

> Is there an existing set of recommendations for dealing with this
> problem (multiple legal compositions) in search and search-like
> applications? Specifically, if there are multiple legal ways to represent
a
> character, how should the character be stored, should search text be
> preprocessede, etc.? Pointers, anyone?

The UTF-8 Corrigendum that went into effect with (or shortly before?) Unicode
3.0.1 clarified that only one UTF-8 sequence -- the shortest one -- is
acceptable for any given Unicode character. This is now part of Unicode 3.1,
so check Unicode Standard Annex #27 at
http://www.unicode.org/unicode/reports/tr27/ .

Otherwise, this sounds like it falls into the domain of normalization forms,
and for that you can check Unicode Standard Annex #15 at
http://www.unicode.org/unicode/reports/tr15/ .

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT