From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Sep 27 2007 - 07:27:47 CDT
> -----Message d'origine-----
> De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
> part de Dmitry Turin
> Envoyé : jeudi 27 septembre 2007 08:11
> À : unicode@unicode.org
> Objet : Re[2]: marks
>
> William,
>
> WJP> can be done quite nicely
> WJP> using markup, e.g.: <font case="upper">foo</font> or whatever.
> (1) You version is markup language (like HTML) instead of simple text.
> I wrote about usual case.
> (2) My proposal not only economize mark-place in table of encoding
> (what is important itself), but also simplifies comparison
> of various variants of spelling (all letters are lower-case,
> first letter is upper-case, all letters are upper-case),
> because comparison is reduced to comparison in one variant
> of spelling (all letters are lower-case).
For (2), your option is not needed. All the solutions are already
standardized in the Unicode standard itself. There's nothing wasted in the
Unicode standard due to the encoding of capitals.
You also seem to assume that capitals have the same semantics as small
letters (may be this is true in your Russian language, but this does not
apply to many languages that have strict rules about the usage of capitals
and that even make differences of semantics); if you ignore capitals in many
languages, you'll find matches that are unrelated (take Italian for example,
"uno" is not synonym of "UNO"), and you'll see that even in proper names
your assumption that only one leading capital is needed is WRONG: there may
be NO capital at the first letter (for example with prefixes), and/or a
required capital in the middle of a proper name, and NO separator or space
between those parts of the name.
Really, your suggestion will just complicate things. Capitals are considered
separate letters since long, and have always been encoded separately (except
possibly in the early period of telegraphs with very reduced alphabets where
ONLY the capitals could be used, forcing all lowercase letters to be
capitalized, but making the texts difficult to read: there was not even the
support for other needed differences like accents).
Your suggestion just looks like if you wanted to return to the age of
telegraph. In that case, you don't need Unicode at all, and not even 7-bit
ASCII: use the 6-bit or 5-bit Baudot-like encodings ! And then try to
transport meaningfull texts for many languages... You'll loose much more.
Stop your suggestions here, consider the layered approach that simplifies
all the problems: Unicode has only encoded some number of characters only to
offer rountrip compatibility with largely used legacy encodings (they would
not be accepted if they were requested today without use in prior
standards), but all the rest is encoded according to principles and sets of
rules and usage algorithms that make it work without needing to encode too
many characters.
Consider also the encoded capitals: how many will you find? Not so many,
they are a very small part of the Unicode assigned codepoints, and they
don't evolve much, because this assumes a bicameral alphabetic script, and
there are not so many scripts with such feature: Latin, Greek, Cyrillic.
Case mappings are already working perfectly with those scripts, as well as
collation. There's no difficulty with case-insensitive searches, the
algorithms are extremely simple and fast in their implementation. There's
much more difficulty when handling letter variants (like those with accents,
diacritics, and contractions in collations like digraphs in some languages).
Your suggestion does not solve any problem that is not already solved, it
just adds more complexity (because it does not work with roundtrip
compatibility, but there are many other reasons why your solution is even
more complicate than what is already encoded now). You have completely
forgotten the goals of the Unicode standard.
This archive was generated by hypermail 2.1.5 : Thu Sep 27 2007 - 07:30:02 CDT