(Fwd) Comments on BiDi in HTML/i18n draft

From: Robert N. Goldrich (bobg@accentsoft.com)
Date: Mon May 20 1996 - 10:08:25 EDT


[Please note that this message was sent to the html-wg but
that my CC field ran out of room!]

I have been following the progress of the HTML working group for
some time now, especially with regards to the issues concerning
the i18n draft by F.Yergeau, G.Nicol, G.Adams, and M.Duerst. I
have also had the opportunity to chat with both Glen (Adams) and
Martin (Duerst) in person regarding various aspects of the
draft.

At this point, as someone involved in the implementation of a
Unicode Web Browser and multilingual HTML authoring tool which
supports BiDi (Hebrew and Arabic) and many other scripts
(www.accentsoft.com), I would like to raise some points
regarding the i18n draft. My comments are broken up by subject.

I. Language Marking
-------------------
The i18n draft recommends that language be marked:
    1. at the document level, using information from the HTTP
        "Content-Language" field, or, presumably, its <META>
        equivalent inside the document.

    2. at the block level via the use of the LANG attribute
        in block HTML tags, such as <P>, <H1>, etc.

    3. at the character level using the newly proposed <SPAN>
        tag.

Comments:
    1. Document level marking is fine. This would be the
        default for the entire page.

    2. We find no need or benefit from having language markings
        specified at the block level. Even the HTML 3 draft
        includes the LANG attribute on far too many tags.
        Language is a word level -- or even character level --
        marking. While it might be useful to mark a block of
        text as a single unit, the same can be accomplished by
        enclosing as much as you want in a character level tag,
        such as <SPAN>.

        Do we really need three "granularities" of language
        marking, given that most other HTML markups can be
        expressed in only one way?

    3. The HTML 3.0 spec calls for a <LANG> tag. Ostensibly,
        the syntax for using this would be:
            <LANG LANG="en">
        What is the disposition of this tag? Though somewhat
        awkward in appearance, we implemented it in our browser
        since it was part of the HTML 3 draft. This is
        something that could be replaced easily with the
        proposed <SPAN> tag, but unless it is taken out of HTML
        3, the use of <SPAN> for language marking is not
        necessary.

    4. The separator between the language and "ethnologue" is a
        period in the HTML 3 draft, while it is a "dash" in the
        i18n draft. Which one is it? Both?

II. Direction of Text
--------------------
Here we are in basic agreement with the i18n draft regarding
the use of the DIR attribute. We understand its use as follows:
    1. When used in the <HTML> tag, the value specifies the
        default direction (also called reading order) of the
        document.

    2. The direction of a block of text can be specified
        explicitly by using DIR as an attribute of a block
        container tag such as <P>, etc.

    3. The direction of individual characters can be set by
        using DIR inside the proposed <SPAN> tag.

All of this is good stuff, however we have the following items
to add:

    1. The <TABLE> tag can also accept DIR. The first cell of
        a right-to-left table (used in Hebrew, Arabic) would be
        in its upper right hand corner. The use of DIR here is
        a must.

    2. The <UL> and <OL> tags need DIR to specify on which
        side the bullets or numbering is to appear. This is not
        the same as the alignment of the list. Again, the use
        of DIR in list tags is required.

    3. By default, DIR="rtl" text blocks should be aligned
        right.

III. BiDi Issues
----------------
    1. No BiDi layout should be performed on text marked with
        the <PRE> tag.

IV. Character Set Identification
--------------------------------
Here we agree on all the methods for identifying the character
set of an HTML document, however we feel the order of preference
for obtaining this information should be:
    1. From a <META> tag embedded in the document itself.
    2. From the HTTP header
    3. From link semantics (though since links, URLs, etc.
        change so often we feel that this is of limited use).
    4. From the byte ordering mark in UCS-2 encoded files.
    5. Any other hueristic for identifying character set.

V. Conclusion
-------------
While the i18n draft is basically a sound document, the issues
raised above deserve consideration for inclusion in in the i18n
section of the HTML standard. Your constructive feedback will
be most appreciated.

Bob Goldrich
Accent Software Intl.
28 Pierre Koenig St.
Jerusalem, Israel 91530
bobg@accentsoft.com
Tel: +972-2-793-723
Fax: +972-2-793-731
----------------------
Robert N. Goldrich
bobg@accent.co.il



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT