From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Sep 28 2007 - 16:01:51 CDT
Otto Stolz wrote:
> Dmitry Turin wrote:
> > but also simplifies comparison of various variants of spelling
> > (all letters are lower-case, first letter is upper-case, all
> > letters are upper-case), because comparison is reduced to
> > comparison in one variant of spelling (all letters are lower-case).
>
> This is plainly wrong. For, e. g., a case-invariant comparison,
> your proposition requires removal of your $B!H(Bmarks$B!I(B, whilst the
> Unicode way requires case folding. Both are commensurably cheap
> operations, on contemporary computers.
+1 for this argument: the proposal does not simplify anything, given that
processing (even if it looks simple, which is a false assumption according
to effective linguistic rules) is still needed for case-insensitive
searches!
> Believe me, computer users are quite a conservative lot:
> they want their data to be readable, editable, and processable,
> for decades, if not for centuries.
+1 for this argument too. That'w what I meant when I spoke about the goals
of the Unicode (and ISO 10646 standards): preserve a roundtrip compatibility
with past existing standards (i.e. encodings), terminating their
proliferation that compomized interoperability of systems (that constantly
needed to be updated to support and interpret more encodings), and creating
a framework where no newer encoding would even need to be created to be
interoperable, even for characters and scripts that are still not encoded
(so that Unicode-based implementations would continue to work reasonably
with lots of immediately supported features with characters and scripts
encoded in the future, as well as with scripts and languages still unknown
to existing software writers).
> You have also written:
> > "Widespread error is equating of designation of a letters (_coding_) and
> > their graphic images (_font_). It$B!G(Bs absolutely different things".
>
> That error is definitely not widespread among the addressees of your
> remark;
> rather, they are used to the notions of $B!H(Bcharacter$B!I(B vs. $B!H(Bglyph$B!I(B.
> However, most of them will agree that a capital A, a small a, a capital
> $B&!&K&U&A(B,
> a small $B&A&K&U&A(B, a capital $B'!'Y(B, and a small $B'Q'Y(B are six different
letters.
>
> But this has nothing to do with the encoding of those letters.
> It was a deliberate decision, based on a history of about 30 years of
> character encoding (before Unicode, as we know it), to assign six
> different
> code position to those six characters, and not three or even only one.
Another thing to note: despite the Greek Alpha looks like the Latin or
Cyrillic A, it behaves differently in association with combining characters,
and Greek offers several conventions about the placement of these characters
(see for example the special case of the iota subscript, notably inrelation
with uppercase mapping... that depends on the Greek convention to use:
historic or modern).
So before proposing something else, Dmitry has to prove that its proposal
will support AT LEAST all the special case mappings that Unicode already
supports, and prove that it offers superior capabilities to handle even more
critical cases. Our argument is that it is not even needed, given that the
existing algorithms are already widely implemented and do work, and that
Dmitry has not even demonstrated anything regarding interoperability (what
Unicode has smartly and very conveniently preserved).
> $B"w(B Armenian, Cyrillic, (Georgian), Greek, Latin; where Georgian
> has not a fully developped case system,
> cf. <http://www.unicode.org/versions/Unicode5.0.0/ch07.pdf>.
In fact, Georgian is not bicameral at all, in its modern script. It was
bicameral but used two separate alphabets for this, and Unicode considers it
now as two distinct scripts, where the modern script is unicameral, and the
extremely rare use of a secondary alphabet from the historical script to
make it bicameral also makes it ambiguous (due to the swapped meaning of
some historical letter forms).
I forgot Armenian as a bicameral script. This does not change things a lot
(and even Armenians are not always using their capitals, but use it only as
a stylistic option in many cases to write texts in all-caps style for
titling or monumental scriptures, given that the two sets of Armenian
letters do not match exactly, with some missing capitals, for which
lowercases letters need a complex mapping rule, just like with the German
Ess-tsett or the historical Latin long s that was used for initial or medial
forms but not for final forms that carried some distinctions in the case the
final form long s was used instead of the usual long s in the middle of a
compound word, or to show a difference between a prefix and a longer
radical, or some other similr distinctions in Greek for letters in final
form).
In general, within the 5 multicameral scripts, the capitalisation of text
removes some semantic differences that cannot always be inferred back
correctly by reconverting the text to small letters. There are more small
letters than capitals, just because in those scripts small letters are the
most modern forms that most widely used for normal text, so newer
distinctive letter forms have been added to the small letters set, without
being necessarily added to the historic set of capitals, which is now less
often used except for some limited cases like initials only or titling.
Writing titles in capitals only is not always correct in all languages
because it removes these letter differences (in addition of loosing the
minuscule/capital distinction for proper names), and that's a good reason
why an alternate "capital-like" style was added later for writing minuscule
letters in titling, i.e. "small capitals", which are NOT capitals, but an
alternate glyphic representation of linguistic minuscule letters, dictinct
from capitals that should remain used only for limited cases.
It's true that we can often see texts written in "all capitals" style, but
this is not a good practice (it seems to work reliably in English for
example, but not in other languages; and anyway it is difficult to read,
looks like SHOUTING, and it makes accents and diacritics difficult to read,
so this should also be limited to short parts of texts).
For these reasons (and many others), case conversion should be used with
care, they are not recommended, should be absolutely avoided when storing
texts, as they are lossy even if you implement them correctly to minimize
the semantic losses according to a reference language (if you effectively
know in which language the text is written... something that is not always
indicated and that you cannot easily infer).
In other words, capital letters are NOT simply equivalent to lower case
letters. They are NOT stylistic glyph variants of the associated small
letters, even if they are closely related.
This archive was generated by hypermail 2.1.5 : Fri Sep 28 2007 - 16:05:26 CDT