From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Sep 28 2007 - 13:47:16 CDT
Dmitry Turin wrote:
> Envoyé : vendredi 28 septembre 2007 11:52
> À : unicode@unicode.org
> Objet : Re[4]: marks
>
> Philippe,
>
> >> (2) My proposal not only economize mark-place in table of encoding
> >> (what is important itself), but also simplifies comparison
> >> of various variants of spelling (all letters are lower-case,
> >> first letter is upper-case, all letters are upper-case),
> >> because comparison is reduced to comparison in one variant
> >> of spelling (all letters are lower-case).
>
> PV> There's nothing wasted in the
> PV> Unicode standard due to the encoding of capitals.
> +
> PV> case-insensitive searches, the
> PV> algorithms are extremely simple and fast in their implementation
> These algorithms are unnecessary in general.
Unnecessary ?!?!?
These algorithms are used and implemented everywhere (at least in their most
basic way for handling the Basic Latin subset, but this is still an
implementation of the algorithm, widely understood, and found in almost all
applications, libraries and OSes handling text data, and written since many
decennials and still used in every computer today!)
Really, you may want a revolution but then you need to consider the huge
cost of the conversion, and handle the conversion from users that are used
since ever to make distinctions between capitals and small letters,
including linguists that need them in their standardized orthographies,
where there are even strict rules about their usage (not in all languages,
where their usage may be quite liberal).
Then handle the tricky things that will appear in technical notations making
STRICT distinctions between lowercase and uppercase letters: think about
Base64 representation of binary data, and what such reencoding would mean
for encapsulating these chains of data in other protocols like Email and
networking protocols. Think about numeric parsers that will have to handle
simple hexadecimal data, and parse an additional unneeded symbol.
Think about phonetic transcriptions that would no longer be searchable if
you remove the distinction between small letters and capitals, andneed to
parse the text contextually by looking if there are some prior "symbol" or
control somewhere at an unknown distance.
Think about those algorithms that try to extract substrings, including text
parsers used for linguistic analysis: what is the rule for inserting your
proposed control? How many controls will you need?
Think about concatenation with your notation: what is the result of "#o"
plus "#nu" : "#o#nu" or "#onu" ? Your notation introduces new unexpected
equivalents that applications would need to recognize, instead of just
having to handle the concatenation of "O" plus "NU" as "ONU", from which it
is simple to extract substrings... Now thing about the effect of word
breakers, line breakers, and the effect of layout rendering: what is the
scope of application of your "#" control? If such scope is unambiguous, then
the only safe choice would be to make this scope limited to only the next
character, so that you'll need to always write "#o#n#u" and not "#onu".
Your proposal is also inconsistent: you propose two distinct controls for
encoding all-caps (I'll note it "*") and leading-cap (I'll note it "#" like
you did). This means that you have now "#o#n#u" and "*onu" encoding the same
text, where capitals are encoded differently. Now extract the initial letter
of both strings, is it "#o" or "*o"? There's no way to determine this, in
both cases they are the initial capital letter of the same word
"Organisation"... And it's illogical to encode the same capital letter in
different ways.
In conclusion, the "*" proposal (next word in all capitals) is superfluous
and just complicate things. So if it remains just your "#" proposal (next
letter only in capital), this means that you have reencoded all existing
individual capitals from "A" to "#a", and... doubled the size of texts using
capitals only. What is the benefit, given that Unicode will still maintain
the encoding of all existing capital letters?
Now suppose that Unicode accepted your "#" proposal only (the only one
producing consistent results for text algorithms, andwhose effect is to
modify only the next letter), it should become a format control (using the
Unicode terminology) usable separately and ignorable in some conditions, but
then what will be the meaning of "#1" or "#!" : not all subsequents
characters would be letters of a bicameral script!
Conclusion: your proposal has not solved any problem, just introduced more
complexity, breaking too many text handling algorithms used in every
computer and almost all text-handling applications, or even in many widely
used and standardized networking protocols (so you'll break interoperability
everywhere, you could even say good bye to the Internet, with so many
protocols to fix). Such proposal is not worth the value, given the huge task
it would mean for others to adapt to your encoding scheme.
But then, if your encoding is just optional (meaning that a capital A could
continue to be encoded as "A" or optionally as "#a", what is the interest of
making such change, except locally within your own local applications? If
you need such transform for your local search algorithm, then transform
texts locally in your system, by reencoding texts using your own control
(you can do that using a PUA), and look at the new caveats that such
conversion will imply: database size, data field length constraints,
interoperability with the rest of the world because you'll need constant
conversions between your local encoding scheme and the rest of the world.
This archive was generated by hypermail 2.1.5 : Fri Sep 28 2007 - 13:51:16 CDT