From:  Kenneth Whistler [SMTP:kenw@sybase.com]    
To:  Edwin.Hart@jhuapl.edu   
Cc:  kenw@sybase.com   
 
Subject:  Re: Guidelines for deciding what to code   
Sent:  10/9/00 3:34 PM 
 Importance:  Normal   
Ed, 

> I had lunch with Sato-San and we discussed his concerns with the 
> character-glyph model.  I'd like to run some thoughts by you. 
> 
> His main concern appeared to be to have a document that he could hand to 
> linguists to educate them and to help guide them in selecting what to encode 
> for minority Southern Asian and Southeastern Asian scripts that have not yet 
> been computerized.  I'm unsure if he wanted this for guidance or for clout. 

A combination of both, I suspect. I spent some time talking to him in 
Beijing about this same thing. And part of what he claims he is trying 
to do is head off the development of 8-bit national standards in Asia 
that use incorrect principles that won't mesh well with Unicode implementations. 

> He also appears to need to deliver such a document as one of his tasks for 
> his job early next year.  I am unaware of such a document.  Since you are a 
> linguist and somewhat familiar with the coding issues, I thought that you 
> might be able to help clarify my understanding of some of the concerns. 
> Here are some of my notes from the conversation. 
> 
> The character-glyph model describes two separate domains, a character domain 
> and a glyph domain and the process to render characters into glyphs for 
> presentation.  The Technical Report uses the following diagram describe the 
> model: 
> 
> Character domain  ?  Glyph Selection/Rendering Process  ?  Glyph domain 
> 
> Sato-San wants to augment the concepts in the character-glyph model to (1) 
> include input methods (processes for converting keystrokes into a stream of 
> character codes) and (2) guidelines for coding the writing system elements. 
> While he did not necessarily want to revise the Technical Report to include 
> this material, he really wanted an authoritative reference document with 
> coding guidelines that he could use in his efforts with language experts who 
> had no knowledge of computers and coding. 

This really isn't that complicated. Sato-san may need to make it more 
complicated, in order to justify his consulting. 

> 
> He thought that input was separate process and that deciding what should be 
> coded should not depend on the input process.  He wanted to define a 
> complete set of functions that a generalized input method would need to 
> handle all writing systems.  

This is, of course, *way way* beyond the task as first described. The 
essential thing that needs doing is to make it clear to people working 
on minority scripts that computer input is an abstraction mediated by 
complex software called a "keyboard driver" (for relatively simple 
cases) or an "input method editor" (for more complicated ones). 

In other words: 

   key(s) pressed != keyboard code returned to system 

   keyboard code returned to system != encoded character stored in text 

This because a keyboard driver or input method editor interprets combinations 
of keys (alt-, shift-, control-, etc.) and/or sequences of keypresses and 
changes them into keyboard return codes. And not all keyboard return 
codes are themselves characters -- many are interpreted as control codes 
or various functions, and get filtered by many further layers of software, 
some of which may themselves introduce macro elaborations, before *some* 
of them get delivered to some process that interprets *some* of the 
keyboard return codes as characters in a particular character encoding. 

And for that matter: 

   character glyph printed on keycap != encoded character stored in text 

This because virtual keyboard handling may be completely unrelated to the 
particular hardware associated with a keyboard, as well as all the 
other abstractions and layers pointed out above. 

Those of us who work in the industry get used to this situation, or may even 
have experience programming parts of it for some system or another. It 
easy to lose track of the fact that this is all a black box mystery to 
most people who use computers. The introduction of GUI's, with more layers 
of abstractions, many of which are designed to give people the *illusion* 
that a keyboard or mouse action is directly wired to what is happening on 
the screen, just make it that much more difficult to break through the 
illusion to describe what is actually taking place. 

This stuff is not really rocket science, however. If Sato wants a definitive 
source about input methods, there are any number of documents that have 
been around in the industry for a long time -- this all predates Unicode, 
by the way. See Kano's, Developing International Software for Windows 95 and 
Windows NT, p. 202 ff, for a long discussion of East Asian input methods 
in Windows. Or Sandra O'Donnell's, Programming for the World, 1994, p. 184 ff, 
"Displaying and Editing International Text", for a discussion of display 
issues related to input method editors. I'm sure SHARE must have a bunch 
of IBM NLS documentation lying around about input methods for East Asia, too. 

> His concerns were: 
> 
> 1.    In some languages, the display order of characters and the phonetic 
> order of characters are different.  How should the characters be ordered in 
> character strings, display order or phonetic order?  I do not recall this 
> question being raised before. 

Of course it has. This is an ancient issue for Middle Eastern implementations 
of IBM software, for example. Cf. p. 190 from the O'Donnell book I cited 
above: 

"There are two fairly common ways to store mixed-direction text: in keyboard 
or display order. with keyboard order (also known as logical order), 
characters are stored internally in the order in which they were entered from 
the keyboard. With display order (sometimes called presentation order), 
characters are stored in discrete units (usually one-line chunks) in the 
left-to-right order in which they appear when rendered on the screen or on 
paper. ..." 

The issue of Thai order of vowels (typewriter order, i.e. visual order, rather 
than logical order) is also a longstanding, known problem of Thai implementations, 
which exists even where not dealing with right-to-left scripts.  

> Also, how should they be entered?  He 
> answered his own question. The input method needs to be able to handle 
> character entry by both display order and phonetic order for the same 
> language because people use both methods. 

Correct. Input methods need to support whatever people customarily want 
to do when entering data. That means they must often support practices 
that first develop in office environments using typewriter technology, 
where typing skills have to then be transferred to automated computerized 
systems. 

> 
> 2.    Some languages have writing elements where one of them is a doubling 
> of another element.  (In the Latin script, you can think of a "w" as a pair 
> of "v" letters or an "m" as a pair of "n" letters.  In some writing systems, 
> a person normally enters the equivalent of a "w" as a pair of "v" elements.) 
> Should the "w" be coded as a pair of "v" elements or a separate element? 
> What happens if the person enters "vvv" in the middle of a word?  How does 
> software decide which 2 of the 3 should be paired (assuming "vvv" does not 
> occur)?  Should it be "wv" or "vw" when either may be valid?  Sato-San gave 
> an example in Hangul syllables, but there the consonants with the "double" 
> glyph have a separate code than the ones with a singly glyph, so this 
> example may provide one answer to the ambiguity of "vvv" or "nnn".  I just 
> not sure if we should generalize this into a principle. 

I think this is a complete non-issue. It certainly has nothing to do 
with encoding per se. If Sato is concerned about this for Hangul, then 
the issue was all worked out years ago. See the discussion of Korean 
input methods in Kano's book, p. 207 ff. 

> 
> As a first thought, the following diagram may form the basis for 
> understanding the additions he is requesting.  He appears to be asking to 
> expand the model in the input (left) side (a) to decide what to code and how 
> to code, and (b) to decide the general processes that would be needed in a 
> generalized input method. 
> 
> Coding Guidelines  ?  Character Code 

This is a completely different issue. And I disagree with the way 
Sato seems to be approaching it. 

For the purposes of coding guidelines now for minority scripts in Southeast 
Asia -- which is the main problem area -- what should be done is to write 
up a succinct summary of the 3 basic encoding models available in 10646 
for Brahmi-based scripts: 

1. The ISCII/Devanagari model 

   This uses virama to encode consonant conjuncts. It uses logical order 
   for all characters, and encodes no duplicated characters for 
   "half" character forms, conjunct parts, or special forms of RA, WA, 
   YA, LA, HA, etc. It encodes a separate series of independent vowel 
   letters and a separate series of dependent ("matra") vowels. 

2. The Tibetan model 

   This does *not* use a virama. It uses logical order for all 
   characters, but encodes a separate series of "subjoined" consonants 
   to deal with consonant combinations. It has only a single series of 
   vowels, which are all dependent. 

3. The Thai model 

   This uses display order, left-to-right, rather than logical order, since 
   it was developed on the basis of typewriter technology. In practice, 
   this means that there are a small number of "left-side" vowels that 
   must be rearranged by processes such as collation, in order to get 
   correct results based on the logical order of syllable sequences. 

All three models make extensive use of combining marks. 

The Thai model is used to encode Thai and Lao. 
The Tibetan model is used to encode Tibetan. 
The ISCII/Devanagari model is used to encode all other Indic scripts, 
   as well as Sinhala, Khmer, and Myanmar. (And is the preferred model 
   for newly encoded Brahmi-derived scripts, unless there is a 
   compelling reason to do otherwise.) 

Anyone proposing to encode another Brahmi-derived script (and almost every 
one of relevance to Sato's concern is Brahmi-based) should *first* study 
these three models, and then, on a principled basis, choose one of the 
three to choose as the basis for encoding that script. That is effectively 
the exact advice that Michael Everson, Rick McGowan, and I give to each 
proponent of another script encoding. The most recent example is our 
work with the Chinese committee on getting 2 different Dai scripts 
encoded in 10646. 

What people should *not* do is suggest encoding a repertoire of *glyphs* 
without understanding the relationship between characters and rendered 
glyphs. 

>                                         ? 
>   Input Process ? Character domain ? Glyph Selection/Rendering Process ? 
> Glyph domain 

For this, I understand that Sato-san needs to get a definitive document 
together that he can hand to people to digest. However, he ought to 
be able to do this from available documentation. There is tons of 
this stuff in the accumulated proceedings from the Unicode Conferences, 
for example. 

You could send Sato a copy of the tutorials from the 1999 and/or 2000 
conference, for example. Your own tutorial from the 1999 conference 
has some of these answers. And Sato could do worse than to digest and 
reconvey some of the information that Richard Ishida regularly imparts in 
his tutorial on Non-Latin Writing Systems. 

--Ken 

> 
> Thanks for your thoughts, 
> Ed 
>