From: ojarnef To: Edwin Hart; Alan Griffee Cc: iso10646; ojarnef Subject: Comments on WD "An operational model for characters and glyphs" Date: Sunday, December 01, 1996 01:02 The following are my personal comments on the ISO document > ISO/IEC JTC 1/SC 2 N2746 > 18 August, 1996 > Title: August 1996 working draft of TR 15285, "An operational model for > characters and glyphs" > Source: Edwin Hart and Alan Griffee (acting editors) > Status: Contribution by the acting editors > Action: For review and comments by 30 November, 1996 > Distribution: ISO/IEC JTC 1/SC 2, SC 2 liaisons, and SC 18 Digression about Internet availability of the draft: That draft is available in MS Word format from < ftp://ftp.jhuapl.edu/pub/cgmodel/cgm9608.doc ;type=i> and in PostScript format < ftp://ftp.jhuapl.edu/pub/cgmodel/cgm9608.ps ;type=i>. (Other formats can also be found there.) I have prepared a plain text version -- lacking non-ASCII characters and diagrams -- from which I include quotes in this message. It can be accessed at < ftp://ftp.admin.kth.se/pub/misc/ucs/sc2-n2746.txt ;type=a> Metadata about this file are in < ftp://ftp.admin.kth.se/pub/misc/ucs/sc2-n2746.txt.M ;type=a> End of digression. Digression about formalities, skip this if you're only interested in the technical comments: Unfortunately I haven't had time to let the Swedish standardization body SIS-ITS review my comments. Therefore I send them directly to the document editors by email. If it's deemed appropriate, I can convert the following comments to a paper document as an "expert contribution" to SC2. I also send this message to the discussion list in the hope of generating some public discussion (the traffic to that list has recently mostly been about other standards than 10646). I don't crosspost to the list because I regard the character-glyph model as primarily a matter for open standardization work rather than for a semi-closed consortium. I don't crosspost to the list because it seems to have a semi-official character and I'm not sure a discussion not confined to national body representatives to SC2 is welcomed there. End of digression. In the following, lines starting with ">" are quoted from the draft technical report. I have replaced non-ASCII characters with the marker "!?". Lines starting with "/" are my suggestions for new text to replace text in the draft. Lines starting with "+" are my suggestions for new text to be added to the draft. > Introduction > ------------ > People recognize and process characters[1] by their shapes. > Thus, people normally closely associate a character[2] and > its shape. Information technology, in contrast, makes > distinctions between the concepts of a character's[3] > meaning (the "character"[4]) and its shape (the "glyph"). > The close association people make between characters[5] and > glyphs, and the distinction made by information > technology have produced a conflict that has led to > misunderstanding and confusion. The third sentence talks about a character's "meaning". But normally individual characters don't have a definite meaning, not in the way individual words of a language have meaning. What _is_ common for all specimens of a certain character then? Take the letter "d" as an example. What's common to all individual d's is, in my view, that they (and they only) can fulfil the same _function_ when writing words which are spelt with this letter. Individual characters are not "elements of meaning", I would say, but elements of meaningful written linguistic expressions, particularly words. The distinction between sign and meaning is fundamental to semantics. To me it's clear that letters belong to the sign side of this dichotomy, not the meaning side. And the same is true for digits. The digit "1" can't be identified with the meaning "the least positive integer". Often it means the number one, but in "123" it means the number 100, and in other contexts it may have no relation at all to numerical quantities. In my opinion these observations generalize to almost all graphic characters of ISO 10646. Digression about the how to define the idealized concept of (graphic) character: It's true, however, that meaning plays an important role in the demarcation of different characters, i.e. in any definition of the idealized character concept. The draft rightly emphasizes that the abstraction from concrete marks on e.g. a paper to abstract _glyphs_ ideally should be based only on consideration of geometrical shapes. Shape is important also in the abstraction from concrete marks to _characters_, though only indirectly. Meaning is the important consideration, and I offer the following attempted definition to show how: A character is a mathematical set of physical marks such that any of them can be substituted for any other without changing the meaning of the text where it occurs. (Here I ignore complications not relevant to the scope of the technical report, such as the atomicity of characters and the dependence of some existing character distinctions upon the writing system used. Therefore the modest label "attempted definition".) Three things should be noted: 1) Indirectly the _shape_ of concrete marks are important for which characters they are realizations of, because shape distinctions are essential for the ability of humans to discern contrasts of meaning between similar text pieces. 2) This definition isn't my free invention. It's actually equivalent to standard definitions in linguistics of the concept of _grapheme_. 3) It defines an idealized character concept. The _pragmatic_ character concept should be defined as "any entity coded in a coded character set standard" (who knows what kinds of things might have been included in some coded character set defined somewhere by some crazy engineers? or will be in the future). This definition needn't be circular. A "coded character set standard" can be defined as a standard that characterizes itself with the expression "coded character set standard" or an equivalent label. End of digression. To return to the text quoted above, it uses the word "character" in two or possibly three different senses, which unfortunately can add to the very confusion it describes: -- In the occurrences 1, 2, and 5 the word "character" means some abstraction of physical shapes used in writing. -- In occurrence 3 it means a _combination_ of a meaning (whatever that may be) and a physical shape (probably not an individual physical shape on e.g. a certain piece of paper, but an "abstract" shape). -- In occurence 4 it seems to mean some kind of meaning, not a thing that _has_ a meaning. This double/triple use of one word also accounts for the paradoxical wording about the "character"[4] being an aspect of the character[3]. A replacement for the quoted text could be something like this: / In all reading and writing of text people recognize the / individual physical marks read or produced on the / writing surface as different realizations of abstract / letters, ideographs, digits, symbols, and other / characters. The digital representation of these / entities is the main task of SC2 standards for coded / character sets. Another kind of abstract entities / related to the physical marks of concrete text, glyphs, / is central to SC18 standardization of font technology. / The relations between the two concepts of character and / glyph, which are easy to confuse, is the subject of / this technical report. The text in the draft continues: > The successful > promulgation and implementation of character coding, text > editing, presentation and publication standards require > an understanding of the appropriate use of character > codes and glyph identifiers. I don't think it's necessary to introduce the technical notions of "character code" and "glyph identifier" at this early point in the report. Furthermore, it should be mentioned that in certain kinds of simple data processing, the distinction between character and glyph isn't needed. Proposed new text: / The successful promulgation and implementation of / character coding, text editing, presentation and / publication standards require an understanding of the / distinction between characters and glyphs, except for / those simple applications where it is acceptable that / the same glyph is always used the same character. > 4. Character and glyph distinctions > ==================================== > > The character and glyph definitions in clause 3, which > were taken from ISO/IEC 10646 and ISO/IEC 9541, were > developed independently and contain terminology that > requires harmonization and explanation. "Harmonization" of two terminologies, as distinct from mere explanation, to me suggests that definitions of some terms are changed or new terms are introduced with new definitions. Is that part of the purpose of this technical report? > In information technology, characters are abstract > information elements in the domain of coding for data > interchange. This is a statement about information technology in general, not restricted to coded character set standards. Therefore I believe it's more correct to write "... the domain of coding for data representation, particularly data interchange". Much text stored in a computer never leaves the local system, it's never interchanged, and still it is coded according to coded character set standards. > Coded character set standards assign > numeric values, character names (descriptive text), and > representative (sample) images to each character > contained in a coded character set. The significance of the parenthesis in "character names (descriptive text)" is unclear. I think it would be better to leave it out and instead add the sentence: + Typically a character is given a multi-word name which + also serves as an adequate description of the + character, making it clear how it differs from the + other characters of the coded character set. > The precise > semantics and appearance of the information elements in > any given implementation are not defined by those coded > character set standards. As I explained above I don't think that characters have any semantics (meanings). What I think is the important thing to say here is that coded character set standards don't include explicit critieria for drawing the line between similar but distinct characters. Possible new formulation: / Criteria for the demarcation between nearly related / characters, to aid decisions about which characters to / choose for representing a particular text, are not / included in those coded character set standards, other / than the guidance given by the character name and one / concrete example of the character. > The ISO/IEC 10646 > standard recognizes the distinction between characters > and their visual representation by defining the term > "graphic symbol". The "graphic symbols" of SC 2 > standards and the "glyphs" of SC 18 standards represent > equivalent concepts. Is this really true? As I read the SC2 definition of "graphic symbol" > 3.12 graphic symbol : The visual representation of a > graphic character or of a composite sequence. (ISO/IEC > 10646-1: 1993). [See the definition of "glyph".] it may very well be interpreted to refer to the _concrete_ physical mark used to represent a character on a particular paper or on a screen at a certain point of time. The SC18 concept of "glyph" is an _abstract_ image, it is abstracted in some other way than characters are, and I have never seen any discussions about this alternative abstraction process in SC2 contexts. I thus believe that the "concrete" interpretation of the SC2 concept of "graphic symbol" that I have formulated here is more plausible. In that case "graphic symbol" should be equated with the SC18 concept of "glyph image", not "glyph" (although the SC2 concept has wider applicability than the SC18 concept, being relevant also for hand-written text). > The historical association of characters and glyphs has > resulted in character sets maintaining distinctions that > cannot be founded on distinctions in content, but only > distinctions in form; similarly, the glyph registration > authority and the SC 18 font resource model have made use > of criteria based on content to abstract potential > distinctions in form. This is a very important point. It may be obscured for many readers by the use of the notoriously ambiguous words "form" and "content". I would prefer a wording such as the following: / The historical association of characters and glyphs has / resulted in character sets maintaining distinctions / that cannot be motivated by the capacity of the / distinguished characters to cause a contrast in meaning / in a text. Exchanging them for each other will only / change the appearance of the text. Similarly, the glyph / registration authority and the SC 18 font resource / model have made use of criteria based on meaning, not / shape, to abstract distinctions between glyphs. > For example, in ISO/IEC > 10646-1, SC 2 coded the glyph FB03 LATIN SMALL LIGATURE > FFI "!?" for round-trip integrity with other standards. > (See B.4 The "round-trip rule" on page 13.). I would prefer a simpler example than this, which involves a compatibility character that is equivalent with a _sequence_ of "genuine" characters, not a single character. The preceding text doesn't mention this complication. Why not use FF21 FULLWIDTH LATIN CAPITAL LETTER A and 0041 LATIN CAPITAL LETTER A as an example? > Also, the > SC 18 Registration Authority (AFII) for ISO/IEC 10036 > could have registered the same glyph identifier for the > "!?" glyph and used it for both the 212B ANGSTROM SIGN > "!?" character and the 00C5 LATIN CAPITAL LETTER A WITH > RING ABOVE "!?" character. However, AFII instead > registered two glyph identifiers. This is a needlessly confusing example, since it involves also a false distinction between _characters_ in UCS. 212B and 00C5 are different characters only because they are included as such in some coded character set standard, viz. ISO 10646. (00C5 is a genuine character, 212B is a compatibility character.) It's as absurd to regard these as different characters as it is to say that in the sentence The speed of light in vacuum is exactly 299792458 m/s. the "metre symbol" in "m/s" is another character than the ordinary letter "m" in "vacuum". (Can anybody clarify if some earlier standard also made the distinction between 212B and 00C5? That would at least motivate their inclusion into UCS by the round-trip rule.) This particular false character distinction has to do with treating the same character as different characters depending on what _function_ it fulfils (letter in a word, or symbol), not with confusing glyph distinctions with character distinctions. It therefore falls outside the scope of this technical report, and this example should be removed, both here and in section E.1. Furthermore, this example was supposed to show problems with SC18 _glyph_ distinctions, not SC2 character distinctions. A better example is needed. I don't have access to the glyph registry, unfortunately, but I suspect that different glyphs have been registered for Latin capital A, Cyrillic capital A, and Greek capital alpha. If that's the case, it would provide an excellent example for this place in the technical report. > Within the realm of information technology, an ideal > characterization of characters and glyphs and their > relationship may be stated as follows: > ... > -- One or more characters may be depicted by no, one, or > multiple glyph representations (instances of an abstract > glyph) in a way that may depend on the context. > > The relationship between coded characters and glyph > identifiers may be one-to-one, one-to-many, many-to-one, > or many-to-many. In its fully general form, it is a > context-sensitive M-to-N mapping where M > 0, N ( 0. Unfortunately, this is a too simple picture. We actually have two different kinds of multiplicity: 1) Some characters can be realized by a combination of 2 or more glyphs, such as the 0132 LATIN CAPITAL LIGATURE IJ. 2) Other characters can be realized by different single glyphs in the same font, depending on the context, such as the four different glyphs needed for each Arabic letter in row 06 of UCS, depending on the character's positions in the beginning, middle, or end of a word, or in isolation. In my opinion the text of the technical report should mention this complication. > (For some characters in ISO/IEC 10646-1, no glyph can be > defined, for example, the ZERO WIDTH NO-BREAK SPACE.) I would say that ZERO WIDTH NO-BREAK SPACE and all the characters in the range 200B - 200F, 2028 - 202E, 206A - 206F are not _graphic_ characters but _control_ characters: They don't correspond to any glyph, not even some amount of white space. On the other hand, they have various other useful effects on the organization or control of data. This is similar to the roles played by the control functions of ISO 6429. And SC2 doesn't recognize any third category of characters, besides graphic characters and control characters. This makes ZERO WIDTH NO-BREAK SPACE fall outside the scope of the technical report, I think. It would be an improvement to include a paragraph early in section 4 that explains the difference between graphic characters and control characters and states that the report is only concerned with graphic characters, using the word "character" as an abbreviation of "graphic character". > This is particularly true for ISO/IEC 10646 > implementation level 3, which uses combining characters. This sentence is probably difficult to understand for many readers. It could either be removed or expanded to a full paragraph, describing the particular complications with font support for combining characters. > 5.2. Composition, layout, and presentation > ------------------------------------------- > > The composition and layout process spans both processing > domains. See Figure 2. I suppose the concepts "composition" and "layout" have well-defined SC18 meanings. Spontaneously I myself think of composition as the process of creating new text or data, normally performed by a human user (entering data or editing text). But it's clear from Figure 2 that composition as the word is used here is something else, needed for the output of text, probably a fully automatic process. Perhaps it would be possible to include definitions of these terms in section 3, or at least include a discussion in section 5.2 of their meanings as used here? > Glyph selection is the process of selecting (possibly > through several iterations) the most appropriate glyph > identifier or combination of glyph identifiers to render > a coded character or composite sequence of coded > characters. Coded characters and their associated > implicit or explicit formatting information represent the > primary inputs to composition and layout processing, and The "associated formatting information" that exist together with coded characters is a new component of the picture, quite abruptly introduced here. I suppose such things as HTML tags are referred to by this phrase. I would like to see a short discussion about plain text, rich text, and "formatting information" somewhere before this point in the technical report. > The degree of glyph > selection intelligence and the positioning of that glyph > selection intelligence varies widely among existing > standards and implementations. I don't understand how an intelligence can be positioned. Has some piece of the original text disappeared here? > 6. Glyph selection > =================== > -- When a 0022 QUOTATION MARK """ character is > encountered, a composition and layout process may have to > determine whether it begins or ends a quotation and then > choose either an opening or closing quotation mark glyph > as appropriate. Alternatively, the process > may select glyphs depending on the language of the text > being formatted (or the formatting style specifications > that apply to the content being formatted). For example, > German text could substitute the "!?" and "!?" glyphs > for quotation marks; and French text, the "!?" and > "!?" glyphs. I don't think this is a clear-cut example of the need to use style information and context in the composition and layout process (which I assume is automatic). I doubt that any automatic processer, however sophisticated, can choose the correct form of quotation mark in all possible cases, if only the neutral 0022 mark is used in the character data. A better approach in applications where a high typographical quality is expected is to bann the indiscriminate use of 0022. Instead I think it is better in many applications that the word processor or other text input software guesses the correct quotation mark (of 2018 - 201F) based on all available information at input time, and immediately displays this mark on the screen. If the user isn't satisfied with what he/she sees, he can directly choose a better quotation mark. When the text is stored in a file, the 10646 quotation character actually chosen is included in the file. > -- When a 002D HYPHEN-MINUS "-" character is encountered, > a composition and layout process may have to determine if > it is used in a math formula, as a separator between > figures (digits), as a separator between words, or as a > separator between syllables. Depending on which context > applies, it will select a minus sign, a figure dash, a > quotation dash, or a hyphen dash (or possibly a hyphen > point) glyph to display the character. This example faces the same kind of criticism. Don't rely on automatic interpretation of your intentions, instead include the correct character in the text when writing it. Automatic choice of dash/minus/hyphen form in the composition and layout process may be necessary in some cases, but it should not be held up as the only way of handling this problem in the report. Better examples of the general thesis are perhaps: -- the choice of final or non-final form of "s" in Fraktur text -- the choice of relative size of capital letters to the small letters depending on whether the text is written in German or not -- the distribution of white space between the words on a justified line of text (where SP characters are used in the character coded data). > In addition, Arabic topography makes extensive use of > ligatures. This should be "Arabic typography" I suppose. I will not comment on the content of the annexes this time, other than observing that they are all labelled "(Informative)". But isn't the technical report as a whole informative? _Can_ it be normative? /Olle -- Olle Jarnefors, KTH, Stockholm, Sweden