RE: Suggestions in Unicode Indic FAQ

From: Kent Karlsson (kentk@md.chalmers.se)
Date: Mon Feb 03 2003 - 12:29:47 EST

  • Next message: Deborah Goldsmith: "Re: How is glyph shaping done?"

    > > No, with proper reordering (and "normal" display mode), the e-matra at
    > > the beginning of the second word would appear to be last glyph of the
    > > first "word". Similarly, for the second case, the e-matra glyph would
    > > have come to the left of the pa. The fluent reader (ok, not me...)
    > > would then see those errors anyway, just like I can find spelling
    > > errors in Swedish, most often without any kind of special marking. (I'm
    > > assuming through-out that reordrant combining characters
    > are reordered.)
    >
    > Illegal sequences

    There are no illegal sequences.

    > are not reordered as you indicated.

    Then that is a problem with the display software you are using.

    > Also, as far as I
    > know there is no mention of reordering of illegal input sequence (or
    > invalid combining mark) in Unicode standard.

    Again, there are no "illegal input sequences".

    > Consider the last set of glyphs (left-to-right, top-to-bottom) in the
    > attached image. It is the rendering effect of illegal input sequence

    See above.

    > "Devanagari Vowel Sign I" [U+093F] + "Devanagari Letter Ka"
    > [U+0915] and without any dotted circle.

    Let's see if I understand you. <093F, 0915> is the input. Since
    093F is a combining character, one should (not must, but should)
    treat this *as if* the input was <0020, 093F, 0915>. Since 093F
    is also reordrant, one must reorder it before the preceding base
    character (at least, more for consonant clusters), so the output
    glyphs would be <<glyph for 0915, space, glyph for 0915>>.
    (But your image does not show that.)

    > As you might be knowing the correct input
    > sequence should be U+0915 followed by U+093F.

    That would be a different input (whether that is correct or
    not depends on the authors intent).

    > In that case the result would
    > have been similar to what appears right now.

    Similar ONLY if you disregard the space "glyph" that should
    have been there.

    > (Though some more
    > sophisticated font/application may want to replace the
    > appearing glyph for
    > U+093F to be substituted by some other glyph with proper
    > attachment point).

    That may be.

    > Now there is no way that user can identify this illegal input sequence
    > without dotted circle.

    Yes, there is. Don't disregard the space "glyph".

    > In the worst case even this rendered glyph is
    > attached to the character from a class (for example,
    > consonant cluster of
    > "Ka" "Virama" "Ma") for which the glyph has been designed to
    > render with.
    > In such case even a fluent reader can not identify the error.
    >
    > >
    > > There are spelling errors, yes. But there are other ways
    > of indicating
    > > spelling errors, that are (by now) fairly conventional for
    > any language
    > > (as long as there is an appropriate dictionary installed),
    > and that also
    > > are more general (in catching more spelling errors) and
    > less obtrusive
    > > (the author really wants to write it that way, for some reason).
    > >
    > > > Apparently, Michka used a non-OpenType Bengali Unicode font when
    > > > he embedded the fonts into the page. As long as you are looking
    > > > at the page on-line, with the embedded fonts, these errors are
    > > > invisible.
    > > >
    > > > It may be typographically horrible. It *should* be
    > typographically
    > > > horrible in order to illustrate bad sequences clearly.
    > >
    > > I'd prefer little red wiggly lines under the word, or
    > yellow background
    > > or some such (just for screen display, not for printing;
    > screen grabs
    > > not counted). And that for any spelling "error".
    >
    > Spelling mistakes can be categorized into two different classes.

    ???

    > One
    > arising from illegal input sequence (e.g., Vowel Sign E as the first
    > character in a word)

    There are no illegal input sequences.

    > and the other one is legal input sequence with no
    > contextual meaning in the dictionary.

    A simple spell checker just checks if the word is in the
    dictionary or not (without worrying about the context).
    That would catch what you call "illegal input sequences" too.

    > While indication of the second type
    > of mistake is generally used only in sophisticated
    > applications like word processor,

    Why? There is nothing in principle hindering a spell checker
    to be used in a "plain text" editor.

    > everyone wants to know the first kind of mistake.

    Without a spell checker, but with proper rendering, spelling
    errors can be detected by a fluent reader, since they look
    different also without any dotted circles. For some ambiguous
    Indic cases, like a prefix matra, consonant, postfix matra, all
    possible character sequences for them are misspellings (as far
    as I know).

    > With your
    > explanation it seems that even plain text editor is not
    > useful at all to identify such common typing mistakes!

    Consider English. If I write "nnnn", that may well be a spell error.
    Do I deserve to get the rendering of that string to be littered by
    dotted circles just because a sequence of four n's "has to" be
    a spell error?

                    /Kent K

    > - Keyur



    This archive was generated by hypermail 2.1.5 : Mon Feb 03 2003 - 13:58:41 EST