Re: Normalization Form KC for Linux

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Fri Aug 27 1999 - 20:08:45 EDT


Kenneth Whistler wrote on 1999-08-27 22:33 UTC:
> But in the Unicode world this does not work. You have to architect
> the layers:
>
> Layer 1: Map the plain text characters into a rendering space (implies
> smarts about scripts, a non one-to-one character to glyph
> mapping, information about the font metrics, and bidi layout).
>
> Layer 2: Embed the glyph vectors into the control code framework for
> terminal control and cursor positioning.
>
> Layer 1 is host business entirely. Only there do you have access to
> the plain text store and a sufficient model of the text to do the
> right thing.
>
> Layer 2 can be modeled on the current terminal control protocols. You
> just need to be aware of the fact that you are dealing with glyph
> codes that map into the terminal display fonts -- *not* with characters.

We certainly agree, that Unicode requires a (for some scripts) somewhat
non-trivial processing step between the memory representation and the
glyph sequence that shows up on the screen at the end. We will probably
end up with more GUI-like libraries (comparable to say the [n]curses
library) that sits between the host application and the terminal
emulator and keeps track of things like the cursor split necessary
for bidi rendering, etc. I get more and more the feeling that handling of
Hebrew, Arabic, and the various Indic scripts will not be feasible by
extending the terminal semantics alone in a way that still allows us to just
dump with printf() the memory representation to the terminal, which will
somehow magically sort out in real-time everything. I can well imagine that
this does work for combining characters (which have a fairly simple
semantics, all state required to interpret a combining character is
after all just the cell coordinates of the last printed character, which you
can easily save), but I have already doubts for Arabic and most
certainly for the Indic scripts.

The answer will simply be, that the traditional dump-memory-to-terminal
printf applications (cat, echo, etc. are typical trivial representatives of
this class) will not work with such scripts.

However, we can make a large number of scripts accessible under Unix in
the old non-layered model rather easily, and there is no reason for not
doing it. It would in my opinion be a fatal mistake to stay with ISO 8859
as opposed to UTF-8, just because we shy away at the moment from a full
Devanagari renderer for xterm.

My reasons for staying with precomposed characters in the Unix non-GUI
environment for quite some time are:

  - The current font infrastructure does not provide the glyph
    annotations necessary for automatic good placement of combining
    characters. We therefore have to work with precomposed glyphs
    and would have to do a normalization C in display
    routines anyway.

  - Many applications have to count characters is strings. This
    is trivial with both ISO 8859-1 (count bytes) and UTF-8
    (count bytes, except those in the range 0x80-0xBF), but it
    becomes more complicated and requires table lookup with the
    introduction of combining characters. We can't expect all
    applications to change over night to more sophisticated UI
    access techniques and there will be heavy resistance if we
    take away beloved simple output methods such as printf().

  - There is no immediate advantage from using combining characters.
    They require more storage, have (at the moment) to be recomposed anyway
    before display, and only save (arguably) a few CPU cycles in
    algorithms such as collating.

More philosophical (and therefore more fun to discuss :):

  - I also fail to see why a decomposed form should in any way
    be more natural. I see the decomposed form more as a technically necessary
    brief intermediate step for rendering fonts that provide font
    compression by storing commonly occurring glyph fractions (e.g., the base
    glyphs and accents, hooks, descenders, etc.) separately and combine
    them only on demand at rendering time. The choices made about which
    glyph components (and yes, we talk about glyphs and not characters
    here) deserve to become Unicode characters on their own right do not
    appear to be very systematic to me and seem to me to be more influenced
    by historic perception than by a clean technical analysis. I have to
    agree with the argument that there is no reason, why "ä" can be decomposed
    into a + ¨, but "i", "j", ";", and ":" can't be decomposed into a
    sequence with a dot-above combining character. After all, all of
    them exist also without the dot above, and many also with
    many other things above (iìíîï). Why isn't Q represented as an
    O with a lower-left stroke? Because all these precomposed characters
    have just stopped to be perceived as being composed by those who
    designed Unicode and its predecessors (ASCII, Baudot, Morse, etc.)
    Nevertheless, G is historically a C combined with a hook, Ws are
    two Vs (or Us) with a negative space in between, + is just a
    "not -" and therefore crossed out, $ = S + |, and @ is just an
    "a" in a circle. It would be just fair to decompose ASCII before
    you start treating the ä as a second-class citizen. :)

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT