From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri May 20 2011 - 17:00:09 CDT
Christoph Päper <christoph.paeper@crissov.de> wrote:
> Doug Ewell:
>>
>> Text editing and processing with combining marks is not "very difficult and erroneous."
>
> The biggest problem with precomposed versus combined characters in text editors and word processors is that they are in fact treated differently.
>
> Input:
>
> Some accented letters are found on keys of their own on relevant national keyboard variants.
> Others can easily be produced by a combination of base letter and dead-key diacritic mark, although they have to be pressed in a different order than they are coded.
> Finally, some accented letters need a special kind of assisstive input system, often visual character maps (though these are often ordered in a not too helpful way, i.e. by Unicode position).
It does not matter how they are entered. The purpose of the input
method is to generate whatever sequence of characters correctly
encoded is appropriate for representing the selected character. It
does not matter if this input method will generate one or more
characters.
But of course, a simple input method that would just present a
character map where some combinations of letters+diacritics can't be
found or even be generated is just a software deficiency, and not a
problem of the encoding in the UCS itself. Such software can always be
enhanced to match what users want to input and see. Sometimes this not
only involves the keyboard driver or input method editor, but as well
the handling in the software when editing existing documents or
correcting it as you've noted below:
> It might be useful if computers offered their users a standard way to access and change diacritics on base letters, no matter how hey were enterd in the first place or how htey are encoded. For instance, I could write “resume”, hit the one special key, e.g. ‘^’, and get an inline drop-down list to change the ‘e’ to ‘é’ (because that is a variant of the word in this instance that was found in the dictionary) or ‘è’, ‘ê’ etc. (shown in an standard fix order by frequency / probability).
>
> Delete:
>
> The backspace (leftwards delete key) and (rightwards) delete keys should always delete one visual entity perceived as a single character by users, i.e. a combination of base letter and accent(s).
I fully approve there. The normal working in editors is to use the
simplest editing method for working with the default grapheme clusters
(but it should be noted that this level is too large for users working
in languages where diacritics are optional or added as supplemenatry
notations, such as Hebrew and Arabic, at least for a large subset of
the diacritics used in those scripts, or for users working with Indic
abugidas, because they still spell at least the vowels distinctly).
> The software could offer a key combination to free selected or adjacent base letters of all their diacritics, though, e.g. [Ctrl+Shift+Del/BS].
As long as this remains an advanced editing feature, notammly not
needed for entering text correctly at the first time and still
allowing normal corrections, this will be fine (it would typically be
used when handling with files that were incorrectly encoded at the
first place, using unsuitable input editors or incorrectly generated
by poor softwares). But for this mode, I would leargely prefer to have
another graphical presentation, more technical, where in fact you
would inpect each character, and all grapheme clusters would be broken
into their indicidual part, including visible controls. This type of
rendering would be mostly for debuggers or data analysis and parsing,
i.e. for software developers mainly, but not for the most frequent
uses by most people.
Such adaptation is in fact not a problem of encoding or of
translations or internationalization, but part of the work that
developers must study for the localization of their software according
to users expectation. There will never be any perfect encoding for all
solutions.
> Storage:
>
> I believe it would help if input immediately was transformed to and text was saved in NFD, because this would make the need for uniform treatment more obvious.
>
> It would be cool if there was an ASCII-compatible encoding with variable length like UTF-8 that supported only NFD (or NFKD) and was optimized for a small storage footprint, e.g. from U+00C0–017F only a handful would have to be coded separately. Sadly, though, it is unrealistic to have a unique single byte code for each combining diacritic, because there are so many of them: even just ranges U+0300–036F and U+1DC0–1DFF are 176 positions together, although some are still unassigned; that is more than you can encode with 7 bits or less.
The most common software practice has been since long to use the NFC
form. NFD is just for some internal technical uses, but in fact no
longer justified given the way that most sofwares communicate between
each other in more heterogeneous systems.
Forget NFKD (and NFKC) completely. This is definitely not for text
input or editing (and probably not even for rendering as well), but
only needed as a compatibility layer across interfaces with now old
software modules (most of them non-Unicode aware), notably as a helper
for transcoding purposes to find a few possible fallbacks.
>> The one use case that Plamen mentioned (a user manually deleting a base letter) is easily trained.
>
> Changing people is harder than changing software, in general.
And I don't see why a single keystroke on the Backspace key would not
delete the same thing as a single keystroke on the Delete key, if this
causes two separate grapheme clusters to be suddenly partly joined
together into a single one, with the normal text rendering where
grapheme clusters (and all other joining types or ligatures) are
rendered as a whole. For normal use, if you delete any base letter,
you have to delete as well the diacritics encoded after it. The same
will be true for mouse and keyboard selections and normal navigation
in the text (using arrow keys possibly with key modifiers).
And the editor should work and behave equivalently if the text in the
background working buffers is encoded in NFC or NFD form or any other
canonically equivalent non-normalized form.
But for lots of reasons, editors that should save for output to
heterogeneous environments should all contain an option to normalize
the whole text when saving (most probably NFC by default, NFD is once
again for some technical interfaces, but these same interfaces can
implement the conversion to NFD themselves if they really depend on
it), simply because it will work with many Unicode-unaware legacy
softwares.
The size of the encoded data is no more much an issue. Storage today
is cheap, bandwidth is constantly cheapre too, and general-prupose
compression schemes are used very efficiently now in so many domains
that it occurs often transparently and without significant performance
cost or additional security risks (when it uses standard open
algorithms used since very long on gigantic amounts of data worldwide)
: it just works very well ; that's why for example UTF-8 was so
largely adopted even though it is a bit less efficient on the surface
that many legacy encodings that were hardly interoperable and stable.
-- Philippe.
This archive was generated by hypermail 2.1.5 : Fri May 20 2011 - 17:04:36 CDT