From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Mon Apr 10 2006 - 11:22:59 CST
Hello,
Tay, William had asked:
> Can accented characters be decomposed in other encodings, e.g. ISO
> 8859-1, as well?
Among other codes, I had mentioned ISO 6937:
> ISO 6937 has been an approach to large character sets by heavy
> use of composition. Quote from ISO 6937/2-1983:
> > Each accented letter or umlaut is represented by a sequence
> > of bit combinations consisting of the coded representation
> > of the relevant non-spacing diacritical mark [...], followed
> > by the coded representation of the relevant basic Latin letter
> > [...]
More specifically, this was from section 4.4 "Coded representations",
subsection a "Accented letters and umlauts".
Now, Kent Karlsson has written:
> That text is at best misleading; I'd say it's completely wrong.
> In actual fact, ISO/IEC 6937 does not encode any combining
> characters, absolutely NONE whatsoever. Nor does it rely at all
> on any kind of composition.
I have quoted from the 1983 version of that standard. I have no
easy access to its 1994, and 2001, versions. So, the parts that
I have quoted may, or may not, have been superseeded. If Kent
Karlson can quote the essential clauses from the current (2001)
version that invalidate my old version, I will be glad to learn
that the gist of that standard has completely been changed within
two revisions.
Definition from ISO 6937/1-1983:
> 3.19 composite graphic symbol: A graphic symbol consisting of a
> combination of two or more other graphic symbols in a single
> character position, such as a diacritical mark an a basic letter,
> for example ä.
So, that version clearly conveys the notion of combining diacritic
marks and base characters. This is exactly what William Tay had asked
about; so I think it was important to mention that standard. Kent,
thank you for reminding us to ISO 646, as well, which I had forgotten
to mention.
Kent Karlsson also has written:
> But [in ISO/IEC 6937] the lead byte NEVER encodes any combining
> character.
I cannot understand the distinction Kent draws between a "non-spacing
diacritical mark" (cf. quote from ISO 6937/2, supra), and a "combining
character". It is just a technical detail, whether the base character
is encoded first (as in Unicode), or last (as in ISO 6937).
> [ISO/IEC 6937] is a multibyte encoding, where lead bytes (with the
> 8th bit set) sort of indicate the accent of the character (but that
> does not always hold true) and the trail byte (if a double-byte code)
> indicates the base character (except when the trail byte is the one
> for space).
The essential difference between ISO 6937 and Unicode is that
ISO 6937 defines a closed inventory of combined characters, while
Unicode allows arbitrary combinations. (This reflects the display
technology available at the respective times of origin.)
Now it just so happens that all compositions in ISO 6937/2 comprise
only one diacritic (plus one base character, of course), which lets
ISO 6937/2 appear similar to a multibyte coded character set; however,
the intent apparently was a composition of one, or several, diacritics
with a base character (cf. definition 3.19, quoted supra) -- only
the original plans to encode characters for more languages (that may
carry more than one diacritical mark) never have been realized.
Best wishes,
Otto Stolz
This archive was generated by hypermail 2.1.5 : Tue Apr 11 2006 - 16:00:21 CST