From: Kenneth Whistler [SMTP:kenw@sybase.com] To: Edwin.Hart@jhuapl.edu Cc: kenw@sybase.com Subject: Re: Guidelines for deciding what to code Sent: 10/9/00 3:34 PM Importance: Normal Ed, > I had lunch with Sato-San and we discussed his concerns with the > character-glyph model. I'd like to run some thoughts by you. > > His main concern appeared to be to have a document that he could hand to > linguists to educate them and to help guide them in selecting what to encode > for minority Southern Asian and Southeastern Asian scripts that have not yet > been computerized. I'm unsure if he wanted this for guidance or for clout. A combination of both, I suspect. I spent some time talking to him in Beijing about this same thing. And part of what he claims he is trying to do is head off the development of 8-bit national standards in Asia that use incorrect principles that won't mesh well with Unicode implementations. > He also appears to need to deliver such a document as one of his tasks for > his job early next year. I am unaware of such a document. Since you are a > linguist and somewhat familiar with the coding issues, I thought that you > might be able to help clarify my understanding of some of the concerns. > Here are some of my notes from the conversation. > > The character-glyph model describes two separate domains, a character domain > and a glyph domain and the process to render characters into glyphs for > presentation. The Technical Report uses the following diagram describe the > model: > > Character domain ? Glyph Selection/Rendering Process ? Glyph domain > > Sato-San wants to augment the concepts in the character-glyph model to (1) > include input methods (processes for converting keystrokes into a stream of > character codes) and (2) guidelines for coding the writing system elements. > While he did not necessarily want to revise the Technical Report to include > this material, he really wanted an authoritative reference document with > coding guidelines that he could use in his efforts with language experts who > had no knowledge of computers and coding. This really isn't that complicated. Sato-san may need to make it more complicated, in order to justify his consulting. > > He thought that input was separate process and that deciding what should be > coded should not depend on the input process. He wanted to define a > complete set of functions that a generalized input method would need to > handle all writing systems. This is, of course, *way way* beyond the task as first described. The essential thing that needs doing is to make it clear to people working on minority scripts that computer input is an abstraction mediated by complex software called a "keyboard driver" (for relatively simple cases) or an "input method editor" (for more complicated ones). In other words: key(s) pressed != keyboard code returned to system keyboard code returned to system != encoded character stored in text This because a keyboard driver or input method editor interprets combinations of keys (alt-, shift-, control-, etc.) and/or sequences of keypresses and changes them into keyboard return codes. And not all keyboard return codes are themselves characters -- many are interpreted as control codes or various functions, and get filtered by many further layers of software, some of which may themselves introduce macro elaborations, before *some* of them get delivered to some process that interprets *some* of the keyboard return codes as characters in a particular character encoding. And for that matter: character glyph printed on keycap != encoded character stored in text This because virtual keyboard handling may be completely unrelated to the particular hardware associated with a keyboard, as well as all the other abstractions and layers pointed out above. Those of us who work in the industry get used to this situation, or may even have experience programming parts of it for some system or another. It easy to lose track of the fact that this is all a black box mystery to most people who use computers. The introduction of GUI's, with more layers of abstractions, many of which are designed to give people the *illusion* that a keyboard or mouse action is directly wired to what is happening on the screen, just make it that much more difficult to break through the illusion to describe what is actually taking place. This stuff is not really rocket science, however. If Sato wants a definitive source about input methods, there are any number of documents that have been around in the industry for a long time -- this all predates Unicode, by the way. See Kano's, Developing International Software for Windows 95 and Windows NT, p. 202 ff, for a long discussion of East Asian input methods in Windows. Or Sandra O'Donnell's, Programming for the World, 1994, p. 184 ff, "Displaying and Editing International Text", for a discussion of display issues related to input method editors. I'm sure SHARE must have a bunch of IBM NLS documentation lying around about input methods for East Asia, too. > His concerns were: > > 1. In some languages, the display order of characters and the phonetic > order of characters are different. How should the characters be ordered in > character strings, display order or phonetic order? I do not recall this > question being raised before. Of course it has. This is an ancient issue for Middle Eastern implementations of IBM software, for example. Cf. p. 190 from the O'Donnell book I cited above: "There are two fairly common ways to store mixed-direction text: in keyboard or display order. with keyboard order (also known as logical order), characters are stored internally in the order in which they were entered from the keyboard. With display order (sometimes called presentation order), characters are stored in discrete units (usually one-line chunks) in the left-to-right order in which they appear when rendered on the screen or on paper. ..." The issue of Thai order of vowels (typewriter order, i.e. visual order, rather than logical order) is also a longstanding, known problem of Thai implementations, which exists even where not dealing with right-to-left scripts. > Also, how should they be entered? He > answered his own question. The input method needs to be able to handle > character entry by both display order and phonetic order for the same > language because people use both methods. Correct. Input methods need to support whatever people customarily want to do when entering data. That means they must often support practices that first develop in office environments using typewriter technology, where typing skills have to then be transferred to automated computerized systems. > > 2. Some languages have writing elements where one of them is a doubling > of another element. (In the Latin script, you can think of a "w" as a pair > of "v" letters or an "m" as a pair of "n" letters. In some writing systems, > a person normally enters the equivalent of a "w" as a pair of "v" elements.) > Should the "w" be coded as a pair of "v" elements or a separate element? > What happens if the person enters "vvv" in the middle of a word? How does > software decide which 2 of the 3 should be paired (assuming "vvv" does not > occur)? Should it be "wv" or "vw" when either may be valid? Sato-San gave > an example in Hangul syllables, but there the consonants with the "double" > glyph have a separate code than the ones with a singly glyph, so this > example may provide one answer to the ambiguity of "vvv" or "nnn". I just > not sure if we should generalize this into a principle. I think this is a complete non-issue. It certainly has nothing to do with encoding per se. If Sato is concerned about this for Hangul, then the issue was all worked out years ago. See the discussion of Korean input methods in Kano's book, p. 207 ff. > > As a first thought, the following diagram may form the basis for > understanding the additions he is requesting. He appears to be asking to > expand the model in the input (left) side (a) to decide what to code and how > to code, and (b) to decide the general processes that would be needed in a > generalized input method. > > Coding Guidelines ? Character Code This is a completely different issue. And I disagree with the way Sato seems to be approaching it. For the purposes of coding guidelines now for minority scripts in Southeast Asia -- which is the main problem area -- what should be done is to write up a succinct summary of the 3 basic encoding models available in 10646 for Brahmi-based scripts: 1. The ISCII/Devanagari model This uses virama to encode consonant conjuncts. It uses logical order for all characters, and encodes no duplicated characters for "half" character forms, conjunct parts, or special forms of RA, WA, YA, LA, HA, etc. It encodes a separate series of independent vowel letters and a separate series of dependent ("matra") vowels. 2. The Tibetan model This does *not* use a virama. It uses logical order for all characters, but encodes a separate series of "subjoined" consonants to deal with consonant combinations. It has only a single series of vowels, which are all dependent. 3. The Thai model This uses display order, left-to-right, rather than logical order, since it was developed on the basis of typewriter technology. In practice, this means that there are a small number of "left-side" vowels that must be rearranged by processes such as collation, in order to get correct results based on the logical order of syllable sequences. All three models make extensive use of combining marks. The Thai model is used to encode Thai and Lao. The Tibetan model is used to encode Tibetan. The ISCII/Devanagari model is used to encode all other Indic scripts, as well as Sinhala, Khmer, and Myanmar. (And is the preferred model for newly encoded Brahmi-derived scripts, unless there is a compelling reason to do otherwise.) Anyone proposing to encode another Brahmi-derived script (and almost every one of relevance to Sato's concern is Brahmi-based) should *first* study these three models, and then, on a principled basis, choose one of the three to choose as the basis for encoding that script. That is effectively the exact advice that Michael Everson, Rick McGowan, and I give to each proponent of another script encoding. The most recent example is our work with the Chinese committee on getting 2 different Dai scripts encoded in 10646. What people should *not* do is suggest encoding a repertoire of *glyphs* without understanding the relationship between characters and rendered glyphs. > ? > Input Process ? Character domain ? Glyph Selection/Rendering Process ? > Glyph domain For this, I understand that Sato-san needs to get a definitive document together that he can hand to people to digest. However, he ought to be able to do this from available documentation. There is tons of this stuff in the accumulated proceedings from the Unicode Conferences, for example. You could send Sato a copy of the tutorials from the 1999 and/or 2000 conference, for example. Your own tutorial from the 1999 conference has some of these answers. And Sato could do worse than to digest and reconvey some of the information that Richard Ishida regularly imparts in his tutorial on Non-Latin Writing Systems. --Ken > > Thanks for your thoughts, > Ed >