Re: Indic editing (was: RE: The real solution)

From: James E. Agenbroad (jage@loc.gov)
Date: Mon Nov 26 2001 - 11:53:33 EST

Previous message: Kenneth Whistler: "Re: Indic editing (was: RE: The real solution)"
In reply to: Marco Cimarosti: "Indic editing (was: RE: The real solution)"
Next in thread: Marco Cimarosti: "RE: Indic editing (was: RE: The real solution)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

                                              Monday, November 26, 2001
It seems to me that we have three separate domains to deal with:
     1. What should be keyed as input of Indic scripts, mainly Devanagari?
     2. How shall Indic scripts data be stored and exchanged?
     3. How should Indic scripts be displayed on screens and in print?

ISCII and Unicode are not concerned with the first. They are very
concerned with the second. There may be general agreement on the third,
but a variety of output devices are involved. Unless ISCII changed from a
phonetic based approach to a graphic based one I doubt that Unicode and ISO
10646 would even consider doing so. Having attended a meeting in 1982 of
those who drafted ISCII I doubt that this will happen. Might it be
possible to key data in user oriented glyph/graphic fashion and then
convert it to a phonetic encoding for storage, processing and sharing? And
then, for rendering, convert it from phoneitc encoding to whatever the
local display needed (OS, fonts, etc.) for human consumption? I do not
know if agreement could be achieved on keyboard layouts for Indian
scripts; though desirable to facilitate mobility of those with keying
skills, such standardization may not be necessry--both qwerty and Dvorak
keyboards can result in ASCII data.
Regards,
Jim Agenbroad (disclaimer and addresses at bottom)
On Mon, 26 Nov 2001, Marco Cimarosti wrote:

> As we all know, Unicode is a "logical" encoding, in the sense that it
> assigns codes to "abstract characters", rather than to the actual signs
> ("glyphs") which are visible on a printed page. This design principle has
> been chosen because it makes all non-visual text processing much easier.
>
> Recently, Arjun Aggarwal this principle has been criticized for Devanagari,
> on the ground that the elements of an Unicode Devanagari string do not
> correspond to the graphic elements of Devanagari text.
>
> Several people have explained in detail how this is not an acceptable
> criticism, because Unicode code points are NOT supposed to be displayed with
> a direct one-to-one mapping to glyphs.
>
> I think that this criticism was addressed adequately, for what concerns the
> ENCODING part, and that it is now Mr. Aggarwal turn to make an effort to
> understand better what he is criticizing.
>
>
> However, I think that only considering the encoding point of view does not
> catch the real reasons behind the discontent are periodically expressed by
> Indian users and engineers.
>
> It has always been my impression that, for a native user of Indic scripts,
> it is much more natural to work with visual glyphs.
>
> Why shouldn't it be so? When you write "Arjun" with a pencil, you trace:
> <a>, <j->, <danda>, <-u>, <repha>, <n->, <danda>, exactly in this order.
>
> Who cares if, by the lexicographic point of view, <j-> plus <danda>
> constitutes a unit? Who cares if, by the phonological point of view, <repha>
> is pronounced before <j->? Who cares if, by a logical point of view, <repha>
> is a <ra> plus a virtual "virama"?
>
> Yet, by the graphical point of view, that name is spelled using that
> sequence of *glyphs*.
>
> Similarly, what the users see on a computer screen are *glyphs*, not
> abstract characters. Consequently, they should be enabled to interact
> (enter, modify, delete) the *glyphs*.
>
> How can users be asked to enter, modify or delete objects (such as "virama",
> "ZW(N)J") which are not visible and tangible on the screen? Or how can they
> be asked to interact with an entity which is in a certain position,
> pretending that it was somewhere else (repha, short i matra)? And why
> should it be forbidden to edit visible and tangible objects (such as the
> "danda" at the right side of many letters) on the basis that "logically they
> do not exist"?
>
>
> See the difference between the name "Arjun" as coded (Š) in terms of Unicode
> characters, and as rendered (Ž) in terms of glyphs (for a visual
> representation of this example, see the attached file ARJUN.GIF.):
>
> Š a ra virama ja -u na
> Ž a j- danda -u repha n- danda
>
> Unicode requires that Š form is converted to Ž form before being displayed.
> This process is called "rendering" and, for Devanagari, it could be
> summarized in four logical steps:
>
> 1: Convert character codes into "glyph codes";
> 2: Join some glyphs (e.g.: turn ra + virama into repha);
> 3: Reorder some glyphs (e.g.: move repha to its visual position);
> 4: Split some glyphs (e.g.: turn full C's into half C's + danda)
>
> (Notice that this is a very schematic algorithm, and that actual
> implementations can vary considerably; especially point 1 and 4 may be
> dropped.)
>
> In the case of "Arjun", the four steps perform the following changes (see
> again ARJUN.GIF):
>
> 1: a ra virama ja -u na
> 2: a repha ja -u na
> 3: a ja -u repha na
> 4: a j- danda -u repha n- danda
>
>
> So far so good: I see "Arjun" on the screen.
>
> But what if now I want to change "Arjun" into, say, "Aljun"? By the
> "logical" point of view, I should simply delete the ra and enter a la in the
> same position.
>
> But, on my screen, there is no ra at all! Moreover, there is no consonant at
> all before the ja, because the group ra+virama is displayed as a combining
> repha AFTER the j+danda+u group.
>
> Looking at the screen, the natural thing to do is to move to the repha and
> delete it, then move between the a and the ja and insert a half la.
>
> In order to accomplish a WYSIWYG editing of this kind, Unicode text should
> be preventively converted to a TEMPORARY INTERMEDIATE FORM, less "logic" and
> more "visual".
>
> In the case of Devanagari, a glyphic representation quite similar to the old
> "font encodings" should be used. With such an intermediate code, the user
> should be enabled to select and delete the danda of a letter to form a half
> letter, to enter or delete a matra i or a repha by placing the cursor in
> their visual position, and so on.
>
> The algorithm to convert Unicode to this intermediate glyphic representation
> already exists, and it is the four steps that I described above, which are
> now part of rendering engines and smart fonts.
>
> The difference is that this algorithm should be run BEFORE going into the
> visualization phase.
>
> The big difference is that editing actions should be executed on this
> intermediate code and, therefore, there is the need of a "DErendering"
> algorithm, which converts a portion of visual text back to real Unicode.
>
>
> A very similar thing has been discussed months ago on this list about
> bidirectional editing. I find that the process of reversing Indic rendering
> is even easier than a "reverse bidi" algorithm.
>
> It is possible to DErender Devanagari text by running the same rendering
> algorithm listed before backwards and with reversed meanings:
>
> 4: Join some glyphs (e.g.: turn half C's + danda into full C's);
> 3: Reorder some glyphs (e.g.: more repha to its logical position);
> 2: Split some glyphs (e.g.: turn repha into ra + virama);
> 1: Convert glyph codes into character codes.
>
> In the case of "Arjun", the four steps perform the following changes (see
> again ARJUN.GIF, reading the four points from bottom to top):
>
> 4: a j- danda -u repha n- danda
> 3: a ja -u repha na
> 2: a repha ja -u na
> 1: a ra virama ja -u na
>
> Notice that such an intermediate code can however be slightly MORE abstract
> than a mere list of glyph variants: tiny and insignificant variations (such
> as the different heights or sizes of combining glyphs, or the choice of
> ligatures that are not strictly mandatory) may still be left to smart fonts
> to handle.
>
>
> Just my 0.02 euros.
>
> _ Marco
>
>

     Regards,
          Jim Agenbroad ( jage@LOC.gov )
     The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.

Previous message: Kenneth Whistler: "Re: Indic editing (was: RE: The real solution)"
In reply to: Marco Cimarosti: "Indic editing (was: RE: The real solution)"
Next in thread: Marco Cimarosti: "RE: Indic editing (was: RE: The real solution)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Mon Nov 26 2001 - 13:05:18 EST