Re: Bidi in HTML

From: Martin J Duerst (mduerst@ifi.unizh.ch)
Date: Mon May 13 1996 - 08:20:48 EDT


Jonathan Rosenne wrote:

>First, I would like to apologize for joining the discussion so late.
>Most of the standards people in Israel were not aware of the draft
>until very recently, and it was a crash project to get the comments
>out.

The draft is the work of the ietf working group on html. Most standards
(in the ISO sense) people are not very aware of ietf work anyway :-(.
The ietf working groups, in contrast to other institutions, are
really pure volunteer groups, and so they frequently have no time
to advertize their work.
However, both Glenn Adams and Francois Yergeau have made presentations
about HTML and i18n on the web last September at the Unicode conference
in San Jose, and for anybody really interested in Unicode not aware of that
work after that conference, there is not much of an excuse.

Anyway, the ietf also has the principle that any acutal or perceived issues
should be resolved by finding good technical solutions acceptable to
a wide range of users, and wellcomes any kinds of technical comments
at any time.

>>When designing BIDI for HTML, we made every attempt to not deviate
>>from the Unicode standard, both in its wording and in its essence.
>>And as far as I can see, we indeed did not deviate from it.
>>And we definitely agree that we don't need different standards.
>
>I believe you did deviate, and will try to show were.

We did not. I explained why below :-).

>Main items:
>
>>- Adding formatting characters so that the finally rendered text
>> looks as desired can lead to very strange display of raw
>> HTML in a bidi-aware editor!
>
>There are simple solutions to this.

You make reference to several things below that could be
interpreted as solutions, but none of it is satisfactory.
Can you be more specific?

>>- Formatting character pairs such as RLO-PDF LRE-PDF and so on
>> could interlace with the markup structure, which is not
>> desirable.
>
>There is no interaction between the text and the markup. The bidi
>algorithm applies only to the text.

Does this mean that I should be allowed to write (in logical order, with
RTL block directionality)
HEBREW1 <q>english1 &LRE; english2</q> HEBREW2&PDF; HEBREW3

even in a strictly conforming document? (That <B><I></B></I> is not
conforming but still accepted by many browsers (with different results)
is a different topic, which can definitely not be resolved in our draft.)

The above will look like (uniform stretches not inverted):
HEBREW1 <q>english1 &LRE; english2</q> HEBREW2&PDF; HEBREW3

HEBREW3 english2" HEBREW2 english1" HEBREW1
if bidirectional embedding is considered only on the text level, with
no interaction with markup. But definitely, in the case of the <Q> markup,
this interacts with the texts (by adding language-appropriate quotation
marks).
It may be argued that the above examlpe is close to nonsense,
but it may very well appear, and because there are means to eliminate
such pathological cases (e.g. with SGML parsing), we should provide
the possibility to actually do so.

>>- The most straightforward implementation is to add the necessary
>> formatting characters to the text and use a Unicode-compliant
>> text "object". Absolutely no implementation change is necessary
>> at the rendering/display level.
>
>The most straightforward implementation is to add nothing to a
>Unicode text and use a Unicode compliant browser.

You don't have to add any processing to a raw Unicode text to
view it in a Unicode text editor. But for an HTML document,
you have to parse it anyway. And there is no such thing as
an "Unicode compliant browser" for the moment, as browsers
include much more than plain text rendering, and Unicode,
for good reasons, does not specify anything in this area.

>>- Formally speaking, Unicode allows supplementation or overriding
>> of some directional characters by higher-level protocols.
>
>This is a misunderstanding of the Unicode specification.

I don't see any kind of misunderstanding in our interpretation,
but quite some wishful thingking and implicit restrictions in
your interpretation. For a discussion, please see below.

>>For HTML (not necessarily for SGML), we have to assume that
>>quite some part of the production is written directly with
>>a raw text editor, for which we will assume standard BIDI
>>support, but not necessary any knowledge about HTML.
>
>1. Using a standard bidi text editor works fine - I tried it with
>Hebrew MS Word, Dagesh (The Hebrew version of Accent) and MS Notepad.
>(They are not Unicode compliant, but similar enough in this respect).
>The only problem is the English markup looks a bit strange. Of course
>one has to set the basic direction to right to left.

"A bit strange" may be what you call it. Other users may well
call it inacceptable. Such things have to be tried out on users
who don't understand the issues involved. Programmers and
standard designers usually are too ready to accept inadequacies
if they understand why they are produced.

>2. A better alternative is to use Hebrew markup, then run it through
>a post-processor or macro that converts the Hebrew tags to English.
>This way it looks right when editing the document, and it's a lot
>easier to type - one doesn't need to switch languages all the time.
>This would be straight one-to-one replacement of English tags with
>Hebrew translations.

This may help, but some problems remain:
1) Because many attributes, in particular URLs, may still be plain ASCII,
        you may still easily meet cases where <A HREF="x.html"> is
        "converted" to <"A HREF="x.html>.
2) As you need a post-processor anyway, it can as well change
        in-line formatting characters to appropriate markup if the
        former is what you want to work with while editing (or it can
        eliminate some formatting characters, such as a LRM between
        the closing '"' and '<', which was used to get the '"' to the correct
        place during the editing stage.

>3. It is safe to assume that sooner or later we will have native bidi
>HTML editors.

Definitely. But presently, we don't, and not everybody will always
have one. Also, there are many different models for HTML editors,
from "WYSIWYG" editors that try to issue HTML to very structured
SGML-based editors. What may look obvious for some application
domain or editor implementation turns out to be completely
exotic for another model.

>4. In any case, I don't think it's a good thing to base the whole
>standard for years to come on the basis of helping users use
>inappropriate tools, especially as there are immediate practical
>solutions (see items 1 and 2 above).

The immediate solutions are not really satisfying (see above).
The standard is indeed designed to help users with not much
tool support (I think this is generally something ietf groups
care about), but also it is designed in a nicely structured way
that should not have any problems to survive time.

Embedding directionality has to be attacked and solved
on the character level for raw text, and to assure that
rendering engines support it. But if you define something
as a short quote (<Q>) in a text structure, I guess you would
rather prefer to tell your tool that it is the quote that is
directionality-embedded, than to specify embedding for
the characters in the quote separately.
Although this might not be the only way of doing things,
we should at least not prohibit it!

>>This is not a deficiency of the Unicode BIDI algorithm
>>as such; it is due to the fact that we are working on a
>>META-level (i.e. describing text with text) instead of
>>working on the plain text level.
>
>The meta-level is not text and does not participate in the bidi
>rendering process.

There are two things:
1) The authoring/editing and the general aspect of HTML as
        a text (as which it appears in most books about HTML,...).
2) The browsing/rendering.

What you say above and below applies to 2), where the markup
is parsed and not shown to the user (apart from "view source").
But it does not apply to 1).

>The correct process is in principle as follows: The meta level
>process analyses the text and extracts a plain character string for
>the element (paragraph or block, not line). For HTML it means
>ignoring line ends and whitespace etc. and removing the tags. In
>parallel, the process remembers the properties of each character,
>such as font, size and color or to what tag they belong. The bidi
>algorithm produces a physical rendering, assigning each character its
>place in the re-ordered text. Each character keeps it's attributes
>through the re-ordering process.
>
>Most bidi word processors work that way.
>
>So you can see that markup does not interfere with the bidi
>algorithm.

There is a difference (small or large depending on your implementation)
between a word processor and an HTML browser. But apart from
that caveat, mostly applying to the fact that HTML is somewhat more
structured than many other stuff, the description above applies
very much also to what is done in the case of our standard:

The HTML parser (meta-level process) analyses the markup and
extracts a plain character string for the block-type element.
As a small and easy-to-implement addition to the parser,
DIR attributes are converted to appropriate directionality
formatting characters. This is part of the process of
"extracting a plain character string".

>> So if you have
>>a two-line embedded text delimited with RLE-PDF
>
>In addition to my previous comment, this is a misunderstanding of the
>meaning of the formatting codes. There are other examples in the
>subject message. While it is "legal" Unicode to place a formatting
>code in front of a large section of "visual" bidi text this was not
>the intention and is not proper use of these codes.

There must be some misunderstanding here. The code I mention
is RLE, which is used for embedding. "Visual" bidi would be done
by overriding, for which RLO has to be used.

>The Unicode bidi
>algorithm is based on the text being logically ordered and on
>implicit directionality. The formatting codes are intended to be used
>in exceptional cases and with a very local scope.

I agree. Embedding has to be used in the case where e.g. an
RTL quote is contained in a LTR quote which is itself contained
in an RTL paragraph. An author definitely has some interest to
keep these quotes short, and the level of embedding low, to
help her headers.
Overriding is necessary unfortunately in the case of part numbers
and the like.
In our draft, we give the necessary explanations (although we try
to be as short as possible, because the draft cannot be an introductory
text on bidi or any other i18n issue). If you want to suggest that
we add a notice to the effect that embedding should be used with
care, and overriding should not be misused for visual presentation,
then this would probably not be a problem adding such a notice.

>Any other use, especially as a practical solution for the quick
>conversion of non-Unicode texts, should not be given a major weight
>in designing standards. Anyhow, there are programs that do a
>reasonable job of converting "visual" bidi to Unicode.

Backwards compatibility is quite many times an issue. And it
is important to help people convert to Unicode to spread Unicode.
But this should definitely not be done by adding LRO-PDF to visual
text.

>>2. Formatting codes
>>
>>The proposed attribute is not ambiguous! Its semantics are
>>defined clearly for all elements.
>
>>A human author, when preparing some HTML text, might think
>>on a <BODY> "This is Right-to-left, so let's add a DIR",
>>she might similarly think on a paragraph "This is Left-to-right,
>>so let's add a DIR", and then again on a <EM> or <SPAN>
>>"This is Right-to-left, so let's add a DIR".
>
>On a block level element, this is OK. She is providing the basic
>direction of the block. In most cases it is superfluous, because the
>majority of block items have the same direction as the document. But
>on an "in-line" element she is doing something completely different:
>she is saying "although I am typing English letters, I want them
>displayed from right to left". I don't think you meant that.

We definitely don't mean that, and we clearly say that the DIR attribute
on in-line elements means embedding, not overriding.

Assume you have the following nesting of quotes:

SOME HEBREW <Q>some english <Q>SOME HEBREW INSIDE ENGLISH</Q>
more english</Q> MORE HEBREW.

Then you should mark this up as follows:

SOME HEBREW <Q>some english <Q DIR=RTL>SOME HEBREW INSIDE ENGLISH</Q>
more english</Q> MORE HEBREW.

Alternatively, you can also add a directionality attribute to the first quote, or to
both, and in the interest of clarity, adding it to both would be best. This
would also allow you to add arbitrary foreign words (e.g. English to Hebrew
and vice-versa) in any level of text, without the need to mark them up,
because the implicit algorithm always takes care of one level of "embedding".

[The only exception for DIR is the <BDO> element, specifically designed
for the few cases where overriding is indeed necessary.]

>>So we decided to use the solution that is easy to understand
>>and very natural to human authors (you just indicate directionality,
>>which will mean whatever appropriate on any given element) and
>
>But the idea with Unicode is that you do not need to indicate
>directionality, except for the global directionality of the document
>and of any exceptional block elements.

The idea with Unicode is that you don't need to indicate directionality
as long as the implicit algorithm works. The implicit algorithm
works until you get neutral delimiters such as '?', in which case
you have to insert an RLM or LRM. We agree that this is a minor
issue.
The next place where the implicit algorithm does not work is
embedding. Unicode provides means to express embedding
in raw text, and it allows (as cited in an earlier mail) to
supplement or override this mechanism in higher-level
protocols.

>> Plain ArabiC<EM>&zwj;emphasized Arabic</EM> plain Arabic.
>
>What about
> Plain French boi<EM>^te</EM> plain French.
>
>These sick examples don't mean anything. The browser would do
>whatever it happens to do, and let's hope it doesn't crash.

So you agree that we should say that the behaviour in these
cases is undefined? (In Arabic text, there might quite be
a need to emphasize part of a word.)

>>The second class of formatting characters is those with long-
>>ranging influence, including RLE, LRE, RLO, LRO, and PDF.
>
>These characters should not be used that way, and there is no need to
>give such use special consideration.

By "long-ranging" I meant "more than the single character on its
right and left". I did not want to imply that this would be, in
the general case, longer than a line or so.

Anyway, if your oppinion is that these should not be used,
by which you probably mean that their functionality should
not be used, then we might just change the standard to
eliminate that stuff and prohibit the use of these characters.
But then you would have problems expressing some
nested quotes in logical order.

>>(Much of this, as well as most of the arguments in the
>>rest of this mail, has been discussed in html-wg
>>extensively, and I would suggest that the relevant
>>parts of the archive is scanned by interested parties.)
>
>I had tried to access them, without success.

The URL is http://www.acl.lanl.gov/HTML_WG/archives.html.
I never had any problems accessing these archives. Please
try again.

>> whether any and all markup opens a new embedding level,
>>or whether there is only a new embedding level when direction
>>is explicitly specified (either by DIR or maybe be indirectly
>>by LANG).
>
>A block level markup resets the embedding level according to its base
>direction (explicit, inherited or implied). Other markup has no
>effect on the embedding level.

There is no problem with block level markup, other than that we
currently don't have implied/implicit as an attribute value.
We might add such a value as a result of this discussion, but
we need some more evidence.
Explicit is done with a DIR attribute, and inherited is done by the absence
of a DIR attribute.

As long as we have directionality attributes on in-line elements
(and your arguments up to now have not convinced me that
we should not have this), we have to know exactly what this
means, and in this case my above question is indeed relevant.

>>There is no other implicit algorithm.
>
>If you don't want a new algorithm don't specify bidi behavior on the
>character level. Just say that the text is rendered according to the
>Unicode specification. The moment one adds things, explains them or
>replaces them with "equivalent" or "identical" features one creates a
>new, different, standard.

HTML is of course a different standard from Unicode. We cannot
avoid having different standards for different things.
But even having different standards, they can (and in our case do!)
use the same algorithm.

>>This is the most reasonable way to guarantee that meta-level
>>and base text level can be kept reasonably separated.
>
>They are kept separated by the means described above. Only the base
>level text is subject to the bidi algorithm.

On the browser side, yes. But the problem is on the author/editor
side.

>>Also, I want to mention here that actually our proposal very
>>clearly adheres to the Unicode standard. Just for the record:
>>
>>Unicode 2.0, in Chapter 3.11, under the title
>>"Higher-Level Protocols", says: "The following are concrete
>>examples of how systems may apply higher-level protocols to the
>>ordering of bidirectional text." and as one of this examples
>>gives:
>>"* Supplement or override the directional overrides or embedding
>>codes by providing information via stylesheets about the embedding
>>level or character direction."
>>At the end of the "higher-level protocol" subsection, the text
>>also says "When text using a higher-level protocol is to be
>>converted to Unicode plain text, formatting codes can be inserted
>>to ensure that the order matches that of the higher-level
>>protocol,...."
>
>This is a misunderstanding. This item should be read in context. HTML
>does not include stylesheets. This is not an option to ignore bidi
>formatting codes or replace them by approximately similar things.

HTML may very soon very explicitly include stylesheets. Please have a look
at the CSS proposal from W3C.

But this is not very relevant. What is relevant is that the introductory
part of the section "higer-level protocols" says that the following
are *concrete examples*. Using some other form of markup instead
of a "stylesheet" in the narrow sense would be another example
that would very well fit in the list.
If this were not what was intended, then the text would have
been written differently!

>In this paragraph, the item which relates most strongly to HTML is
>the first: "A higher-level protocol may provide for overriding the
>basic level embedding, such as on a field, paragraph, document or
>system level.". Unicode says "may", but in the context of HTML it
>should be taken as a must. The draft does this, but needs cleaning
>up.

I agree. But the first item is not the only item that applies to HTML.
As you seem to agree, HTML IS a higher-level protocol, and so all
items (potentially) apply. And as the items are only a list of examples,
it has also to be checked whether there are other, similar things
that might potentially apply, and that might need special treatment
due to the nature of the higher-level protocol and its structure
and uses.

>>3. Language attribute
>
>>Why should it interact on the global level, but not elsewhere?
>
>It is needed on the document level, we have to know whether it is a
>left to right document (maybe with some right to left in it) or a
>right to left document. On block level elements LANG could be
>considered to imply DIR.

It is important to know what the directionality of the
overall document is. But this can be done directly with a DIR
attribute; it is not necessary to derive it from the LANG attribute.
Your argument for implication of DIR from LANG is not
compelling, and you did not address my concern that the
number of RTL languages is not easy to enumerate.

>On in-line elements, which do not have a
>base direction, it is cannot provide directional information - if one
>uses Arabic or Hebrew characters, they are right to left. If an
>override is needed, there are appropriate characters in Unicode, this
>is not an attribute of the element.

In-line elements do not have a base direction, but they may
have an embedding direction. Embedding is not overriding.

>>5. Justification - the ALIGN attribute
>
>>True, but this is the responsibility of the browser implementor
>>to do it that way (or fail on the market). There is no need to
>>specify that in the standard.
>
>The purpose of the HTML specification is to make sure that a document
>will appear equivalently or similarly in any conforming browser. If
>two browsers disagree on the interpretation of ALIGN and the other
>features for which the same rationale is proposed, and if we were to
>accept that this is not part of the spec, they would both be
>conforming and the result would be terrible.

I agree that if ALIGN is specified, it should be clearly defined. In the
absence of any additional information, I thing we can assume that
right is right and left is left anywhere in the world. The ALIGN
attribute is applied to individual blocks anyway, so that things
should be clear enough.

What I meant by my above statement "no need to specify" is
the case where no ALIGN attribute is present. Some browsers
might then justify the text (most browsers currently don't do this,
unfortunately). Others might still have it "left", and these of course
should change "left" to right in case block directionality is RTL.
I consider this i18n UI common-sense knowledge, and so probably
do others in this field.

>>>With the data of each field, the form should return the direction
>>>attribute actually selected by the user in addition to the
>>>character set.
>>
>>This idea is new. Is there any specific application where this
>>would be needed, or a wide general requirement? In those probably
>>rather few cases where it is really needed, why not just add
>>a button where the user can set this information?
>
>Proper interpretation of bidi text requires a base direction. If the
>user is allowed to change the DIR attribute of the field, the server
>needs to know.

My counterposition would be that proper interpretation does not
need a base direction. Text is stored logically, and this logical
sequence should be clear enough.

If the user is allowed to select the block direction(s) of some
text entered, then this is probably more of a presentation
issue, to which different users may react differently, and
which might also depend very much on context (immagine
a single item entered by a user from a Hebrew page to
a database, and then displayed in a long list on an
English page).
Anyway, inserting a RLM or LRM provides a means to
achieve what you intend in cases where this is really
necessary (I would really like to see actual examples
of this; the examples I tend to find (see above) usually
suggest that block directionality in forms is much more
a local issue.).

>>>As proposed, the form should be able to restrict the user's input
>>>to a specific character set, according to the requirements of the
>>>server.
>>
>>Can you make a proposal of how this could be done? Should some
>>relation between the "charset" parameter of the document and
>>of the field be allowed? What if I obtain a document via a
>>conversion proxy in Unicode?
>
>Maybe we should add a "regular expression" attribute to the form
>field. This could be useful for other purposes too. It would be
>independent of the coding scheme.

Definitely an interesting idea, but very far-reaching.
If every, it may be much better to present this as a separate item.

>>4. SGML
>
>>Does "Hebrew tag" mean that the tags are actually written
>>with Hebrew letters, or just that they are additional tags
>>written with Latin letters, such as <ph> below?
>
>In Hebrew, using Hebrew letters. For example, the Hebrew tag for ph
>is the letter Pe, and for p the letters Pe Alef.
>
>>Please note that we have considered allowing (almost) any
>>character from ISO-10646 in tags, but have not done it because
>>we have met too much resistance, and because it would have
>>been too much work for not enough benefit.
>
>I have not suggested it. It has no added value. The author can
>use his own language for tags and then post process to English.

Nice to see an issue we agree on. Hope these get more numerous
as we continue our discussion :-).

>>This is neither very general (the Arabs will prefer <pa>) and
>>nor very structured (mixing attributes with entities).
>
>(I assume they would use <pa> for an Arabic paragraph). It has the
>advantage of being short and convenient. <p> is the most common tag.

>>In this description, I miss RLE and LRE. If they are indeed
>>supported, it would be nice if you tell us how this is done.
>>If they are not supported, I would be interested to know why.
>>Also, I don't understand whith what IMD is paralles in Unicode.
>
>RLE and LRE are meaningful only with Unicode, where they are provided
>as characters. RTL, LTR and IMD are intended for non-Unicode
>character codes. IMD specifies Unicode-like implicit directionality.

As far as I understand, IMD just says: Take the Unicode default
solution for deciding on the base directionality of a block.
Can you confirm this? What is the base directionality if
there is no DIR attribute present? Or what should it be?
LTR or IMD?

>>Also, I would like to know what SI 1680 prescribes if the
>>formatting characters are indeed available. For example,
>>is it allowed to start an overriding level with an <RLO> tag
>>and end it with a PDF character, or start with an LRO character
>>and end with </LTR>?
>
>The is no <RLO> tag, only an <RTL> tag. RTL provides the basic
>direction, RLO overrides the character direction.

Suppose there were inline directionality markup, mostly in the
form of the DIR attribute for embedding, as in our draft.
What would be the preferred way of mixing this with in-line
formatting characters?

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT