Decomposable characters with marks (or other combining characterrs)...

From: Aleksandar Susnjar (shule@planet-intra.com)
Date: Fri Aug 06 1999 - 15:53:48 EDT


We are developing a web server product that involves multilingual (international) support where every document can contain multiple languages. All client-server communication is being ported to UTF8, as we are moving towards the internationalization. I'vre read lots of documents about i18n and Unicode but some things still bother me. I can not find answers anywhere and actually digging the answer myself would be too time consuming. I hope that you can give me those answers...

Our product has the document indexing and search feature which has features similar to, e.g. www.altavista.com. It indexes all documents and stores the references to all tokens in them. Before we began i18n, it just lowercased all words before indexing and before the actual search, to be able to find word "Word" even when a user looks for "WORD" or "word" or "WoRd". With introduction of Unicode, the lowercasing must be replaced with something that we call 'normalization' or 'regularization'. This would include the decomposition of all decomposable characters, lowercasing all lowercaseable characters and mapping all other (e.h. CJKV) characters into (possibly) other, but equivalent characters.

Decomposing characters is easy. Lowercasing them too (if they have a lowercase version). If they are upper- or title- case, they can stay like that, as well and will not make any problems to our search engine. What is ambiguous to us is what should we do in the following cases:

1. Decomposable character (that decomposes to few letter-characters) is followed by mark(s) (combining character(s)). Where do marks apply? What if the same mark is specified multiple times?

Example:

[01F1] [030C]
should be decomposed (during regularization) to:

a) [0044] [005A] [030C]
b) [0044] [030C] [005A]
c) [0044] [030C] [005A] [030C]
d) something else... what?

Translated:
[DZ] [Combining Caron]
 
a) [D] [Z] [Combining Caron]
b) [D] [Combining Caron] [Z]
c) [D] [Combining Caron] [Z] [Combining Caron]
d) something else... what?

-------------------------------------------------------------------

2. Decomposable character (that decomposes to a character and a mark) is followed by another (or even same?) mark(s) (combining character(s)). What if the marks are the same? Should we 'colapse' the two same marks into one?

Example:

[01C4] [030C]

where:

[01C4] -> [0044] [017D]
&
[017D] -> [005A] [030C]

should be decomposed (during regularization) to:

a) [0044] [005A] [030C] [030C]
b) [0044] [005A] [030C]
c) [0044] [030C] [005A] [030C]
d) [0044] [005A]
e) something else... what?

Translation:

[DZ with Caron] [Combining Caron]

where:

[DZ with Caron] -> [D] [Z with Caron]
&
[Z with Caron] -> [Z] [Combining Caron]

should be decomposed (during regularization) to:

a) [D] [Z] [Combining Caron] [Combining Caron]
b) [D] [Z] [Combining Caron]
c) [D] [Combining Caron] [Z] [Combining Caron]
d) [D] [Z]
e) something else... what?

-------------------------------------------------------------------

2. Decomposable character (that decomposes to multiple other characters of the same kind, e.g. 0xFDFA - "ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM" - it decomposes to three words!) is followed by another combining/mark character(s).

-------------------------------------------------------------------

3. Surrogate characters - are they (some of them) decomposable? Can they be regularized? Where can we find tables?

-------------------------------------------------------------------

4. Are there such regularization/equivalence tables for CJKV Ideographs? If yes, where? Are there equivalent CJKV Ideographs - some books say that there are! They mention that there are even 20 different characters for the same meaning and pronounciation! Is this true?

-------------------------------------------------------------------

5. Even though Unicode 2.1 and Unicode 3.0 do not use surrogates, will they allocate the space used for surrogates for something else or is this space reserved for this purpose?

-------------------------------------------------------------------

6. Should surrogate characters be encoded in UTF8 (for web use in Netscape 4+ and IE4+) using two UTF8 sequences (one for each part/half of the surrogate character, this results to six bytes) or using an extended 4-byte or 5-byte UTF8 encoding of some UCS4 space? How are surrogates mapped in UCS4? Is UCS4 == Unicode for non-surrogate characters?

-------------------------------------------------------------------

I know that this is a lot of questions, but I need help desperately! If somebody can help, please do so!

Best regards,
Aleksandar Susnjar

Planet-Intra
1181 Oullette Avenue
Windsor, Ontario N8W4B3
Canada

Phone: +1-519-252-8109 x225



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT