From: ojarnef
To: Edwin Hart; Alan Griffee
Cc: iso10646; ojarnef
Subject: Comments on WD "An operational model for characters and glyphs"
Date: Sunday, December 01, 1996 01:02


The following are my personal comments on the ISO document

> ISO/IEC JTC 1/SC 2 N2746

> 18 August, 1996

> Title: August 1996 working draft of TR 15285, "An operational model for
> characters and glyphs"
> Source: Edwin Hart and Alan Griffee (acting editors)
> Status: Contribution by the acting editors
> Action: For review and comments by 30 November, 1996
> Distribution: ISO/IEC JTC 1/SC 2, SC 2 liaisons, and SC 18

Digression about Internet availability of the draft:

   That draft is available in MS Word format from
   < ftp://ftp.jhuapl.edu/pub/cgmodel/cgm9608.doc ;type=i>
   and in PostScript format
   < ftp://ftp.jhuapl.edu/pub/cgmodel/cgm9608.ps ;type=i>.
   (Other formats can also be found there.)
   
   I have prepared a plain text version -- lacking non-ASCII
   characters and diagrams -- from which I include quotes
   in this message. It can be accessed at
   < ftp://ftp.admin.kth.se/pub/misc/ucs/sc2-n2746.txt ;type=a>
   Metadata about this file are in
   < ftp://ftp.admin.kth.se/pub/misc/ucs/sc2-n2746.txt.M ;type=a>

End of digression.

Digression about formalities, skip this if you're only
interested in the technical comments:

   Unfortunately I haven't had time to let the Swedish
   standardization body SIS-ITS review my comments.
   Therefore I send them directly to the document editors
   by email. If it's deemed appropriate, I can convert the
   following comments to a paper document as an "expert
   contribution" to SC2.
   
   I also send this message to the <iso10646@listproc.hcf.jhu.edu>

   discussion list in the hope of generating some public
   discussion (the traffic to that list has recently
   mostly been about other standards than 10646). I don't
   crosspost to the <unicode@Unicode.ORG> list because I
   regard the character-glyph model as primarily a matter
   for open standardization work rather than for a
   semi-closed consortium. I don't crosspost to the
   <sc2@dkuug.dk> list because it seems to have a
   semi-official character and I'm not sure a discussion
   not confined to national body representatives to SC2
   is welcomed there.

End of digression.

In the following, lines starting with ">" are quoted from
the draft technical report. I have replaced non-ASCII
characters with the marker "!?".

Lines starting with "/" are my suggestions for new text to
replace text in the draft.

Lines starting with "+" are my suggestions for new text to
be added to the draft.

> Introduction
> ------------

> People recognize and process characters[1] by their shapes.
> Thus, people normally closely associate a character[2] and
> its shape.  Information technology, in contrast, makes
> distinctions between the concepts of a character's[3]
> meaning (the "character"[4]) and its shape (the "glyph").
> The close association people make between characters[5] and
> glyphs, and the distinction made by information
> technology have produced a conflict that has led to
> misunderstanding and confusion.

The third sentence talks about a character's "meaning".
But normally individual characters don't have a definite
meaning, not in the way individual words of a language
have meaning.

What _is_ common for all specimens of a certain character
then? Take the letter "d" as an example. What's common to
all individual d's is, in my view, that they (and they
only) can fulfil the same _function_ when writing words
which are spelt with this letter.

Individual characters are not "elements of meaning", I
would say, but elements of meaningful written linguistic
expressions, particularly words. The distinction between
sign and meaning is fundamental to semantics. To me
it's clear that letters belong to the sign side of this
dichotomy, not the meaning side. And the same is true for
digits. The digit "1" can't be identified with the
meaning "the least positive integer". Often it means the
number one, but in "123" it means the number 100, and in
other contexts it may have no relation at all to
numerical quantities. In my opinion these observations
generalize to almost all graphic characters of ISO 10646.

Digression about the how to define the idealized concept of
(graphic) character:

   It's true, however, that meaning plays an important
   role in the demarcation of different characters, i.e.
   in any definition of the idealized character concept.
   The draft rightly emphasizes that the abstraction from
   concrete marks on e.g. a paper to abstract _glyphs_
   ideally should be based only on consideration of
   geometrical shapes. Shape is important also in the
   abstraction from concrete marks to _characters_,
   though only indirectly. Meaning is the important
   consideration, and I offer the following attempted
   definition to show how:

   A character is a mathematical set of physical marks
   such that any of them can be substituted for any other
   without changing the meaning of the text where it
   occurs.

   (Here I ignore complications not relevant to the scope
   of the technical report, such as the atomicity of
   characters and the dependence of some existing
   character distinctions upon the writing system used.
   Therefore the modest label "attempted definition".)

   Three things should be noted:

   1) Indirectly the _shape_ of concrete marks are
      important for which characters they are
      realizations of, because shape distinctions are
      essential for the ability of humans to discern
      contrasts of meaning between similar text pieces.

   2) This definition isn't my free invention. It's
      actually equivalent to standard definitions in
      linguistics of the concept of _grapheme_.

   3) It defines an idealized character concept. The
      _pragmatic_ character concept should be defined as
      "any entity coded in a coded character set
      standard" (who knows what kinds of things might
      have been included in some coded character set
      defined somewhere by some crazy engineers? or will
      be in the future). This definition needn't be
      circular. A "coded character set standard" can be
      defined as a standard that characterizes itself
      with the expression "coded character set standard"
      or an equivalent label.

End of digression.

To return to the text quoted above, it uses the word
"character" in two or possibly three different senses,
which unfortunately can add to the very confusion it
describes:

-- In the occurrences 1, 2, and 5 the word "character"
   means some abstraction of physical shapes used in
   writing.

-- In occurrence 3 it means a _combination_ of a meaning
   (whatever that may be) and a physical shape (probably
   not an individual physical shape on e.g. a certain
   piece of paper, but an "abstract" shape).

-- In occurence 4 it seems to mean some kind of meaning,
   not a thing that _has_ a meaning.

This double/triple use of one word also accounts for the
paradoxical wording about the "character"[4] being an
aspect of the character[3].

A replacement for the quoted text could be something like
this:

/ In all reading and writing of text people recognize the
/ individual physical marks read or produced on the
/ writing surface as different realizations of abstract
/ letters, ideographs, digits, symbols, and other
/ characters. The digital representation of these
/ entities is the main task of SC2 standards for coded
/ character sets. Another kind of abstract entities
/ related to the physical marks of concrete text, glyphs,
/ is central to SC18 standardization of font technology.
/ The relations between the two concepts of character and
/ glyph, which are easy to confuse, is the subject of
/ this technical report.

The text in the draft continues:

> The successful
> promulgation and implementation of character coding, text
> editing, presentation and publication standards require
> an understanding of the appropriate use of character
> codes and glyph identifiers.

I don't think it's necessary to introduce the technical
notions of "character code" and "glyph identifier" at
this early point in the report. Furthermore, it should be
mentioned that in certain kinds of simple data
processing, the distinction between character and glyph
isn't needed. Proposed new text:

/ The successful promulgation and implementation of
/ character coding, text editing, presentation and
/ publication standards require an understanding of the
/ distinction between characters and glyphs, except for
/ those simple applications where it is acceptable that
/ the same glyph is always used the same character.

> 4.  Character and glyph distinctions
> ====================================
> 
> The character and glyph definitions in clause 3, which
> were taken from ISO/IEC 10646 and ISO/IEC 9541, were
> developed independently and contain terminology that
> requires harmonization and explanation.

"Harmonization" of two terminologies, as distinct from
mere explanation, to me suggests that definitions of some
terms are changed or new terms are introduced with new
definitions. Is that part of the purpose of this
technical report?

> In information technology, characters are abstract
> information elements in the domain of coding for data
> interchange.

This is a statement about information technology in
general, not restricted to coded character set standards.
Therefore I believe it's more correct to write "... the
domain of coding for data representation, particularly
data interchange". Much text stored in a computer never
leaves the local system, it's never interchanged, and
still it is coded according to coded character set
standards.

> Coded character set standards assign
> numeric values, character names (descriptive text), and
> representative (sample) images to each character
> contained in a coded character set.

The significance of the parenthesis in "character names
(descriptive text)" is unclear. I think it would be
better to leave it out and instead add the sentence:

+ Typically a character is given a multi-word name which
+ also serves as an adequate description of the
+ character, making it clear how it differs from the
+ other characters of the coded character set.

> The precise
> semantics and appearance of the information elements in
> any given implementation are not defined by those coded
> character set standards.

As I explained above I don't think that characters have
any semantics (meanings). What I think is the important
thing to say here is that coded character set standards
don't include explicit critieria for drawing the line
between similar but distinct characters.

Possible new formulation:

/ Criteria for the demarcation between nearly related
/ characters, to aid decisions about which characters to
/ choose for representing a particular text, are not
/ included in those coded character set standards, other
/ than the guidance given by the character name and one
/ concrete example of the character.

> The ISO/IEC 10646
> standard recognizes the distinction between characters
> and their visual representation by defining the term
> "graphic symbol".  The "graphic symbols" of SC 2
> standards and the "glyphs" of SC 18 standards represent
> equivalent concepts.

Is this really true? As I read the SC2 definition of
"graphic symbol"

> 3.12  graphic symbol : The visual representation of a
> graphic character or of a composite sequence. (ISO/IEC
> 10646-1: 1993).  [See the definition of "glyph".]

it may very well be interpreted to refer to the
_concrete_ physical mark used to represent a character on
a particular paper or on a screen at a certain point of
time. The SC18 concept of "glyph" is an _abstract_ image,
it is abstracted in some other way than characters are,
and I have never seen any discussions about this
alternative abstraction process in SC2 contexts. I thus
believe that the "concrete" interpretation of the SC2
concept of "graphic symbol" that I have formulated here
is more plausible. In that case "graphic symbol" should
be equated with the SC18 concept of "glyph image", not
"glyph" (although the SC2 concept has wider applicability
than the SC18 concept, being relevant also for
hand-written text).

> The historical association of characters and glyphs has
> resulted in character sets maintaining distinctions that
> cannot be founded on distinctions in content, but only
> distinctions in form; similarly, the glyph registration
> authority and the SC 18 font resource model have made use
> of criteria based on content to abstract potential
> distinctions in form.

This is a very important point. It may be obscured for
many readers by the use of the notoriously ambiguous
words "form" and "content". I would prefer a wording such
as the following:

/ The historical association of characters and glyphs has
/ resulted in character sets maintaining distinctions
/ that cannot be motivated by the capacity of the
/ distinguished characters to cause a contrast in meaning
/ in a text. Exchanging them for each other will only
/ change the appearance of the text. Similarly, the glyph
/ registration authority and the SC 18 font resource
/ model have made use of criteria based on meaning, not
/ shape, to abstract distinctions between glyphs.

> For example, in ISO/IEC
> 10646-1, SC 2 coded the glyph FB03 LATIN SMALL LIGATURE
> FFI "!?" for round-trip integrity with other standards.
> (See B.4 The "round-trip rule" on page 13.).

I would prefer a simpler example than this, which
involves a compatibility character that is equivalent
with a _sequence_ of "genuine" characters, not a single
character.  The preceding text doesn't mention this
complication. Why not use FF21 FULLWIDTH LATIN CAPITAL
LETTER A and 0041 LATIN CAPITAL LETTER A as an example?

> Also, the
> SC 18 Registration Authority (AFII) for ISO/IEC 10036
> could have registered the same glyph identifier for the
> "!?" glyph and used it for both the 212B ANGSTROM SIGN
> "!?" character and the 00C5 LATIN CAPITAL LETTER A WITH
> RING ABOVE "!?" character.  However, AFII instead
> registered two glyph identifiers.

This is a needlessly confusing example, since it involves
also a false distinction between _characters_ in UCS.
212B and 00C5 are different characters only because they
are included as such in some coded character set
standard, viz. ISO 10646. (00C5 is a genuine character,
212B is a compatibility character.)

It's as absurd to regard these as different characters as
it is to say that in the sentence

   The speed of light in vacuum is exactly
   299792458 m/s.

the "metre symbol" in "m/s" is another character than the
ordinary letter "m" in "vacuum".

(Can anybody clarify if some earlier standard also made
the distinction between 212B and 00C5? That would at
least motivate their inclusion into UCS by the round-trip
rule.)

This particular false character distinction has to do
with treating the same character as different characters
depending on what _function_ it fulfils (letter in a
word, or symbol), not with confusing glyph distinctions
with character distinctions. It therefore falls outside
the scope of this technical report, and this example
should be removed, both here and in section E.1.

Furthermore, this example was supposed to show problems
with SC18 _glyph_ distinctions, not SC2 character
distinctions. A better example is needed. I don't have
access to the glyph registry, unfortunately, but I
suspect that different glyphs have been registered for
Latin capital A, Cyrillic capital A, and Greek capital
alpha. If that's the case, it would provide an excellent
example for this place in the technical report.

> Within the realm of information technology, an ideal
> characterization of characters and glyphs and their
> relationship may be stated as follows:
> ...
> -- One or more characters may be depicted by no, one, or
> multiple glyph representations (instances of an abstract
> glyph) in a way that may depend on the context.
> 
> The relationship between coded characters and glyph
> identifiers may be one-to-one, one-to-many, many-to-one,
> or many-to-many.  In its fully general form, it is a
> context-sensitive M-to-N mapping where M > 0, N ( 0.

Unfortunately, this is a too simple picture. We actually
have two different kinds of multiplicity:

1) Some characters can be realized by a combination of 2
   or more glyphs, such as the 0132 LATIN CAPITAL
   LIGATURE IJ.

2) Other characters can be realized by different single
   glyphs in the same font, depending on the context,
   such as the four different glyphs needed for each
   Arabic letter in row 06 of UCS, depending on the
   character's positions in the beginning, middle, or
   end of a word, or in isolation.

In my opinion the text of the technical report should
mention this complication.

> (For some characters in ISO/IEC 10646-1, no glyph can be
> defined, for example, the ZERO WIDTH NO-BREAK SPACE.)

I would say that ZERO WIDTH NO-BREAK SPACE and all the
characters in the range 200B - 200F, 2028 - 202E,
206A - 206F are not _graphic_ characters but _control_
characters: They don't correspond to any glyph, not
even some amount of white space. On the other hand,
they have various other useful effects on the
organization or control of data. This is similar to
the roles played by the control functions of ISO 6429.
And SC2 doesn't recognize any third category of
characters, besides graphic characters and control
characters.

This makes ZERO WIDTH NO-BREAK SPACE fall outside the
scope of the technical report, I think. It would be an
improvement to include a paragraph early in section 4
that explains the difference between graphic characters
and control characters and states that the report is
only concerned with graphic characters, using the word
"character" as an abbreviation of "graphic character".

> This is particularly true for ISO/IEC 10646
> implementation level 3, which uses combining characters.

This sentence is probably difficult to understand for
many readers. It could either be removed or expanded to
a full paragraph, describing the particular
complications with font support for combining
characters.

> 5.2.  Composition, layout, and presentation
> -------------------------------------------
>
> The composition and layout process spans both processing
> domains.  See Figure 2.

I suppose the concepts "composition" and "layout" have
well-defined SC18 meanings. Spontaneously I myself
think of composition as the process of creating
new text or data, normally performed by a human user
(entering data or editing text). But it's clear from
Figure 2 that composition as the word is used here is
something else, needed for the output of text, probably
a fully automatic process. Perhaps it would be possible
to include definitions of these terms in section 3, or
at least include a discussion in section 5.2 of their
meanings as used here?

> Glyph selection is the process of selecting (possibly
> through several iterations) the most appropriate glyph
> identifier or combination of glyph identifiers to render
> a coded character or composite sequence of coded
> characters. Coded characters and their associated
> implicit or explicit formatting information represent the
> primary inputs to composition and layout processing, and

The "associated formatting information" that exist
together with coded characters is a new component of
the picture, quite abruptly introduced here. I suppose
such things as HTML tags are referred to by this phrase.
I would like to see a short discussion about plain
text, rich text, and "formatting information" somewhere
before this point in the technical report.

> The degree of glyph
> selection intelligence and the positioning of that glyph
> selection intelligence varies widely among existing
> standards and implementations.

I don't understand how an intelligence can be
positioned. Has some piece of the original text
disappeared here?

> 6.  Glyph selection
> ===================

> -- When a 0022 QUOTATION MARK """ character is
> encountered, a composition and layout process may have to
> determine whether it begins or ends a quotation and then
> choose either an opening or closing quotation mark glyph
> as appropriate. Alternatively, the process
> may select glyphs depending on the language of the text
> being formatted (or the formatting style specifications
> that apply to the content being formatted).  For example,
> German text could substitute the "!?" and "!?" glyphs
> for quotation marks; and French text, the "!?" and
> "!?" glyphs.

I don't think this is a clear-cut example of the need
to use style information and context in the composition
and layout process (which I assume is automatic). I
doubt that any automatic processer, however
sophisticated, can choose the correct form of quotation
mark in all possible cases, if only the neutral 0022
mark is used in the character data. A better approach
in applications where a high typographical quality is
expected is to bann the indiscriminate use of 0022.
Instead I think it is better in many applications that
the word processor or other text input software
guesses the correct quotation mark (of 2018 - 201F)
based on all available information at input time, and
immediately displays this mark on the screen. If the
user isn't satisfied with what he/she sees, he can
directly choose a better quotation mark. When the text
is stored in a file, the 10646 quotation character
actually chosen is included in the file.

> -- When a 002D HYPHEN-MINUS "-" character is encountered,
> a composition and layout process may have to determine if
> it is used in a math formula, as a separator between
> figures (digits), as a separator between words, or as a
> separator between syllables. Depending on which context
> applies, it will select a minus sign, a figure dash, a
> quotation dash, or a hyphen dash (or possibly a hyphen
> point) glyph to display the character.

This example faces the same kind of criticism. Don't
rely on automatic interpretation of your intentions,
instead include the correct character in the text when
writing it. Automatic choice of dash/minus/hyphen form
in the composition and layout process may be necessary
in some cases, but it should not be held up as the only
way of handling this problem in the report.

Better examples of the general thesis are perhaps:
-- the choice of final or non-final form of "s" in
   Fraktur text
-- the choice of relative size of capital letters to
   the small letters depending on whether the text is
   written in German or not
-- the distribution of white space between the words
   on a justified line of text (where SP characters
   are used in the character coded data).

> In addition, Arabic topography makes extensive use of
> ligatures.

This should be "Arabic typography" I suppose.

I will not comment on the content of the annexes this
time, other than observing that they are all labelled
"(Informative)". But isn't the technical report as a
whole informative? _Can_ it be normative?

/Olle

--
Olle Jarnefors, KTH, Stockholm, Sweden <ojarnef@admin.kth.se>