From: Doug Ewell (doug@ewellic.org)
Date: Tue Nov 18 2008 - 20:41:45 CST
<abysta at yandex dot ru> wrote:
> If I need a multi-character letter “s with acute”, I have to choose 
> between 015B and 0073+0301. Wouldn’t it be better not to have to 
> choose?
In some ways, it probably would have been better.  It certainly would 
have made things simpler to understand and to explain.
However, at the time Unicode was conceived, it would have been 
impossible to persuade vendors and developers to make the switch from 
existing 8-bit character sets, such as those in the ISO 8859 family, 
unless most (if not all) of the mappings from these character sets to 
Unicode were 1-to-1.
At the same time, the Unicode pioneers realized that the set of 
letters-with-diacritics was more or less open-ended, and it would be 
somewhere between extremely time-consuming and inefficient and downright 
impossible to encode them all as precomposed characters.  For this 
reason and others, the combining characters were also added.
When you choose between <015B> and <0073 0301>, you are essentially 
choosing a normalization form, and at that point, the rest of your 
decision process is fairly straightforward -- keep all your text in the 
same normalization form.  This means you would not want to use both 
<015B> and <0073 0301> in the same text.
Sometimes there are external influences that steer you toward one form 
or another.  For example, the specifications for some protocols strongly 
recommend that you use Normalization Form C, in which you would use 
<015B> rather than <0073 0301>, but in which you would be obligated to 
use <04E9 0304> since there is no precomposed equivalent.
-- Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14 http://www.ewellic.org http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
This archive was generated by hypermail 2.1.5 : Tue Nov 18 2008 - 20:44:23 CST