Re: I18N of HTML - Hebrew

From: Martin J Duerst (mduerst@ifi.unizh.ch)
Date: Tue May 07 1996 - 13:16:00 EDT

Next message: Martin J Duerst: "Re: I18N of HTML - Hebrew"
Previous message: Jonathan Rosenne: "Re: I18N of HTML - Hebrew"
Maybe in reply to: Jonathan Rosenne: "Re: I18N of HTML - Hebrew"
Next in thread: Martin J Duerst: "Re: I18N of HTML - Hebrew"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In this mail, I am mainly responding to some comments from
Jonathan Rosenne, but it should also contain an answer to the
comments of Ken Whistler, and explain in more detail what
Gavin Nicol has said.

>This article is a comment to "Internationalization of Hypertext
>Markup Language", <draft-ietf-html-i18n-03.txt> dated 13 February
>1996, with respect to the requirements and implementation of
>bidirectionality (bidi, in short).

Your comments are very wellcome, as well as any other comments.
And they have indeed helped me detect some inadequacies in the
current draft, although these are not exactly where you see them.
The i18n draft is (hopefully) very close to finalization, and
I would urge anybody to send any comments they may have *NOW*,
because we don't want to wait with the finalization.
Comments don't need to be as detailled as the one from
Jonathan, but a clear statement of the preceived problem
if extremely helpful, as well as a clear proposal for
the changes necessary.

>My comments address bidi from the point of view of Hebrew. I
>expect my Arab colleagues to address their point of view.

>I am posting this to the html-wg list, and comments, praise and
>criticism may be posted there or sent to me.
>
>I am sending this also to other people, but please respond
>through the above mentioned channels.

I am cross-posting to the Unicode list. Please forward this
mail to the other people on your list.

>1. Summary
>
>The implementation of bidi should conform to the Unicode
>specification. As proposed, there are some deviations.
>
>Markup should provide higher level parameters rather than replace
>low level functions.

When designing BIDI for HTML, we made every attempt to not deviate
from the Unicode standard, both in its wording and in its essence.
And as far as I can see, we indeed did not deviate from it.
And we definitely agree that we don't need different standards.

But the actual circumstances of how HTML is used led to our
specific use of the Unicode BIDI specification for HTML,
and I still think that it is very well justified. The main
arguments for this (more details below) are:

- Adding formatting characters so that the finally rendered text
looks as desired can lead to very strange display of raw
HTML in a bidi-aware editor!

- Formatting character pairs such as RLO-PDF LRE-PDF and so on
could interlace with the markup structure, which is not desirable.

- The most straightforward implementation is to add the necessary
    formatting characters to the text and use a Unicode-compliant
    text "object". Absolutely no implementation change is necessary
    at the rendering/display level.

- Formally speaking, Unicode allows supplementation or overriding
of some directional characters by higher-level protocols.

I will add most of my comments to specifit items written by
Jonathan, but during writing, I found out that I have first
to explain in more detail the first point above.

For HTML (not necessarily for SGML), we have to assume that
quite some part of the production is written directly with
a raw text editor, for which we will assume standard BIDI
support, but not necessary any knowledge about HTML.

Assume now that we insert all the necessary formatting characters
into the text stream so that after removing the markup and
doing the other SGML processing (whitespace,...), we get a
text that in a standard Unicode BIDI renderer displays the
way we want it. The problem is that because the characters
used for the markup (e.g. all these "<" and "H" and
"1" and ">" and so on) all have their own directionality
(mostly LTR and neutral), and thus our HTML text in our
editor can look quite garbled and unitellegible. The other
way the problem also exists, i.e. if an author edits a text
until it looks appropriately organized with respect to BIDI
in his editor, there is no guarantee that if the whitespaces
are processed and the markup is removed, the BIDI ordering is
as desired.

This is not a defficiency of the Unicode BIDI algorithm
as such; it is due to the fact that we are working on a
META-level (i.e. describing text with text) instead of
working on the plain text level. This kind of meta-level
was never considered for Unicode BIDI as well as other
ways of metaness (e.g. a book about character encoding
or typography) were considered when deciding on the
set of characters in Unicode.
Also, it is not only HTML (or SGML) that is affected;
any kind of meta-level, e.g. a C-program containing many
string constants in Hebrew, or many comments in Arabic,
may also be affected, although in these cases the problems
won't appear that often in general.

That the problem outlined above indeed appears should
be quite obvious for example from the fact that
HTML source usually has a CR and/or LF after each
line, which makes each line an independent block,
whereas in the text ready for rendering, all these
lines come into one and the same block. So if you have
a two-line embedded text delimited with RLE-PDF, if
you split the lines, the embedding will stop at the
line break, and the PDF will be spurious (I currently
ignore whether this is illegal, will be ignored, or
has any special effect).

Another example is as follows (with lower case for
RTL (e.g. Hebrew) and upper case for LTR, and &rle; and
&pdf;, not existin in the draft as entities, standing for
the actual inline characters.). Assume
the following backing store of the HTML editor:

&rle;abc def ghi&pdf;

The idea of this is clear, the author wants to have the
display as

ihg fed cba

as a result, with "fed" as an anchor. In a raw text editor
with Unicode bidi support, this looks as follows:

ihg <A/>fed<"A HREF="X.HTML> cba

[Here are some of the steps necessary to derive this:
 abc def ghi
 RRRNNLNLLLLNNLNLLLLNNRRRNNLNNRRR
 RRREELLLLLLLLLLLLLLEERRREELEERRR
 RRRRRLLLLLLLLLLLLLLRRRRRRRLRRRRR
 11111222222222222221111111211111
 abc <LMTH.X"=FERH A">def</A> ghi
]

As this is read from right to left, the fact that </A>
has changed into <A/> may be a minor nuissance or even
desired, but the move of the closing '"' to the beginning
of the <A> tag is definitely too bad. If we insert a LRM
after the '"', we can solve this case (given that we have
changed the SGML character set declarations appropriately,
which might have other undesired effects), but there are
many other cases, and we can not assure that what we see
on screen in the editor corresponds to what will be
rendered after parsing.

>2. Formatting codes
>
>The subject document, in clause 4.2. Markup for language-
>dependent presentation, proposes removing several Unicode
>formatting codes and replacing them with markup.
>
>"On block-type elements, the DIR attribute indicates the base
>directionality of the text in the block; if omitted it is
>inherited from the parent element. On inline elements, it makes
>the element start a new embedding level (to be explained below);
>if omitted the inline element does not start a new embedding
>level."
>
>The proposed attribute is ambiguous, it has two meanings
>depending on whether the element it is attached to is considered
>"inline" or "block-type".

The proposed attribute is not ambiguous! Its semantics are
defined clearly for all elements. We have considered using
different attribute names on inline and block-type elements,
but have on purpose not done it to make things easier for
authors.

A human author, when preparing some HTML text, might think
on a <BODY> "This is Right-to-left, so let's add a DIR",
she might similarly think on a paragraph "This is Left-to-right,
so let's add a DIR", and then again on a or 
"This is Right-to-left, so let's add a DIR". If we use
different attribute names, we will very soon see authors use
both of them on both types of elements, and browsers accepting
both on both types of elements. This is the maybe unsatisfactory
reality of practical HTML!

So we decided to use the solution that is easy to understand
and very natural to human authors (you just indicate directionality,
which will mean whatever appropriate on any given element) and
we are confident that implementors, much smaller in number,
more familliar with the exact definitions and workings of the
algorithm, and more strictly logical in their thinking,
will do the right thing as we have specified it in the draft.
There is not much else you could do to reasonably implement
DIR on these elements anyway.

>"A set of named character entities is added that allows partial
>support of the Unicode bidirectional algorithm [UNICODE], plus
>some help with languages requiring contextual analysis for
>rendering:"
>
>The Unicode characters which are more or less equivalent to the
>"inline element attribute" are apparently not allowed.
>The full set of Unicode bidi formatting codes is required, in the
>form of named character entities. These are characters in the
>Unicode (UCS-2) character set, just as any other character, and
>there is no way to exclude them and still claim compatibility
>with those standards. Neither is it justified: they were included
>in the standards because they were considered necessary.
>
>Nor is the proposed change useful: since equivalent function is
>provided, it is not a simplification. It is just different. And
>it raises the question: What do we do with text that does include
>these formatting codes?

There is no explicit statement that some formatting characters
are not allowed. It might be necessary to add such a statement,
if indeed desired, or to add an explicit statement to the contrary.
The character set declarations in the SGML declaration definitely
doesn't exclude any character values, so that the conclusion
that some characters are not allowed may be premature.
This is a place where the draft should be improved.

In general, there are two classes of formatting characters,
those with primarily local influence and those with more
wide-reaching influence.

For the first class, which includes RLM, LRM, ZWNJ, and ZWJ,
there is definitely no need nor an intention to exclude them
as characters. Indeed, in SGML, it would be impossible to
define a CHARSET without them, and then define the corresponding
entities based on numeric character references, as done in the
draft. The reason why these are available is that they are not
contained in some MIME "charset"s even if they might be desirable.

There is only one point that I feel is missing here, namely an
exact specification of what exacly has to happen if one of
these codes appears at the beginning or the end of content, i.e.
immediately before or after markup. This is absolutely independent
of whether these characters appear directly in the text or whether
they appear as named entities. For example, in the following
case (bidi ignored):
Plain ArabiC&zwj;emphasized Arabic plain Arabic.
what glyph shape is the explicitly capitalized "C" supposed to
take? Is &zwj; supposed to affect it, or not? In view of
consistency between raw HTML (with markup) and the final rendering,
and considering structural independence, it would be desirable if
the C did not ligate to the following text inside the .
This or any other desired behaviour should be stated explicitly.
If consistency between raw HTML and the final rendering is desired,
some special actions (insertion of &zwnj;) is necessary in some cases,
especially if there is a &zwj; or nothing before or after a tag that is
immediately preceded or followed by a character that might ligate.
No special action is necessary in case of a &zwnj; because this
is the default anyway. For the case of &rlm; and &lrm;, see further
below. Other definitions are possible, but in my eyes would not
make sense. Some other characters, such as NBSP and such, also
migth need some attention.

The second class of formatting characters is those with long-
ranging influence, including RLE, LRE, RLO, LRO, and PDF.
These are the characters that make reasonable editing in
a Unicode bidi raw text editor a headache, as explained
above. One might come up with various solutions for this.

One solution is to say that you need very special tool support
for editing Bidi HTML, which would deprive Hebrew and Arabic
of one of the important and easy ways of creating HTML
documents. For some cases, it might be okay to overwrite the
directional property of the '"' (clearly against the Unicode
standard, where these properties are normative), but for more
complicated cases, this may not be enough.

Another solution, the one we have tried to implement, is to
get rid of the confusion by avoiding formatting. The idea here
is indeed that an HTML document should not contain them,
and probably we have to say so cleary in the text and also
exclude them in the CHARSET declaration.

This does not make the markup behave different from the characters,
and is no "arbitrary" picking-out and prohibition of characters.
The "long-range" bidi characters are just those that cause big
headaches on the meta-level, while for anything else, such as NO-BREAK
SPACE, the situation is much simpler, if not trivial.

The model is then that an author can edit her texts in
a raw Unicode text editor (or any other bidi editor, for
that case). Ideally, she sets the basic level embedding for
all lines to LTR, which will assure that at least the full tags
are rendered properly, and separates embedded or overridden text
in different lines to help understand the rest.

Formatting characters should be used not at all or only in
some very restricted circumstances. Such a limitation might
be supported by the editor (which would not be against the
Unicode bidi standard), or the formatting characters with
long-ranging influence might be eliminated after editing
by a special tool (and then the text would be reviewed by
the author to check whether it still conformed to her
intentions).

In view of the existing text-only bidi documents, one
might argue that it would be easier to allow formatting
characters such as RLE or PDF (at least as long
as they don't interlace with other markup), but the fact
that a converter replacing these characters by appropriate
markup is easy to construct led to the conclusion that
this was not necessary.

(Much of this, as well as most of the arguments in the
rest of this mail, has been discussed in html-wg
extensively, and I would suggest that the relevant
parts of the archive is scanned by interested parties.)

There is one more detail that I think should be checked
and then specified exactly in the draft. It is the question
of whether any and all markup opens a new embedding level,
or whether there is only a new embedding level when direction
is explicitly specified (either by DIR or maybe be indirectly
by LANG). In my oppinion, only the first alternative makes
sense, especially because otherwise, there migth be a danger
or running out of stack positions (a total of 16 in the
bidi algorithm). In fact, especially in the case of
LANG, but maybe even in the case of DIR and thus in general,
it could be specified that an embedding markup only has effect
(in the sense of finally increasing the embedding level) if it
is different from the embedding/overwriting directionality
of the surrounding text.

Depending on the answer to the above questions, the question
of the exact behaviour of &rlm; and &lrm; in vicinity of markup
can be discussed. If every textual markup should force a new
level, this question is automatically solved. Otherwise, it has
to be checked carefully. One solution is to replace markup
with a single &lrm; character, but probably other solutions
might be better.

>The SII TC 1109 wish to state clearly: "We will not support
>another Hebrew implicit algorithm".

There is no other implicit algorithm. The idea of the whole
thing is to have a clearly defined markup so that HTML files
can be edited reasonably as raw text, but that this markup is
easy to map to a basic text stream that can then be
rendered with any Unicode-compliant text "object" or renderer.

This is the most reasonable way to guarantee that meta-level
and base text level can be kept reasonably separated.

>The specification should be modified as follows:
>
>The complete Unicode/UCS-2 character set is supported.
>
>Formatting code characters may be represented by named character
>entities.
>
>Bidi "inline element attributes" should be deleted. Bidi
>attributes apply only to block-level elements, and provide base
>directionality, as proposed in the subject document.
>
>HTML tags provide the global direction for the document and the
>base direction of each element.
>
>In my opinion, these suggestions not only bring the proposal for
>HTML I18N in line with Unicode, but they also simplify it
>considerably.

They would simplify the draft, but would leave the problem
of meta-level editing wide open.
Also, I want to mention here that actually our proposal very
clearly adheres to the Unicode standard. Just for the record:

Unicode 2.0, in Chapter 3.11, under the title
"Higher-Level Protocols", says: "The following are concrete
examples of how systems may apply higher-level protocols to the
ordering of bidirectional text." and as one of this examples
gives:
"* Supplement or override the directional overrides or embedding
codes by providing information via stylesheets about the embedding
level or character direction."
At the end of the "higher-level protocol" subsection, the text
also says "When text using a higher-level protocol is to be
converted to Unicode plain text, formatting codes can be inserted
to ensure that the order matches that of the higher-level
protocol,...."

It is our very firm understanding that HTML is indeed a higher-
level protocol (and one of the most widely used, for that), and
that the above sentences more or less exactly can be mapped to
what we are proposing in the draft.

As this might be one of the first occasions where higher-level
protocols are discussed widely, we also have a very strong
interest to find a good solution for a wide range of usage
models, and maybe even a solution that can serve as a model
for other protocols.

>3. Language attribute
>
>There is no one-to-one correspondence between the language and
>the formatting codes.

The draft does not say there is a one-to-one correspondence.
The language explaining ZWJ and such mainly explains that
these codes are *needed* for some languages. This is
important to make the majority of readers not familliar with
such things understand that this is not just a nice-to-have
(as such it would have been turned down long ago).

>ZWJ and ZWNJ could have use also in English - for example, in
>"proper" English print adjacent f and l are joined to form an fl
>ligature, and ZWNJ could be used to override this if separate
>letters were desired in a system that implements typographical
>printing.

According to Unicode 2.0, this is not a desired or intended
behaviour of ZWNJ, it is just what migth happen on some systems.
Unicode 2.0 says that there are no character controlling ligatures
(except where explicitly indicated, e.g. in the case of Indic
scripts) and that ZWJ/ZWNJ are controlling cursive writing,
not ligating.

>In my experience with multi-lingual writing, the user does not
>bother tagging short phrases embedded within another language.
>
>The language attribute should not interact with the bidi
>attributes, with one exception: the document level language
>attribute (which the subject document suggests will be specified
>in the HTTP Content-Language header) provides the global
>direction. If the language is a right to left language, based on
>the primary tag (such as iw or ar) the global direction is right
>to left, otherwise it is left to right.

Why should it interact on the global level, but not elsewhere?
Independently of whether users mark up short phrases with the
respective language, in case they do it makes sense to have
this markup affect directionality consistently for all HTML
elements. (The current draft says nothing about such an
influence on any level, which I assume to mean that there
is no such influence. The main problem in defining such
an influence is that the number of languages that are by
default written RTL is not really that small, and that there
are languages, such as Turkish, that can be written with
different scripts and directions).

>4. The DIR attribute
>
>The proposed DIR attribute confuses the "block" level direction
>(the base direction of the piece of text) with the character
>level direction.

Regarding "confusion" or "ambiguity", see my comments above.

>The DIR attribute should be defined for "block-type elements" as
>proposed, with the values LTR and RTL. Whether it is best to
>attach it to the SPAN element or to each element of the document
>I don't know. The DIR attribute would specify the base
>directionality of an element of text.

SPAN is not a block-type element. If nothing else, the
DTD should make this very clear.

>For the lower level elements, the Unicode formatting codes as
>defined in the Unicode standard provide a standard, extensively
>discussed and well understood solution. Their implementation as
>named entities makes them available with any character coding
>scheme.

>Note: The DTD includes another DIR and I suggest it would be
>better to choose another name for the directionality DIR.

There is indeed an element "DIR". Elements and attributes
in SGML are not so easily confused, but it might be better
to separate them. What is your suggestion? On the other hand,
the DIR element is not very widely used.

>5. Justification - the ALIGN attribute
>
>When a paragraph has right to left directionality, its
>justification is normally on the right. This applies to all lines
>of a right justified paragraph, and to the last line of a fully
>justified paragraph.
>
>Thus in a right to left element the roles of the right and left
>margins are interchanged, unless specifically overridden. If a
>right to left document that is right justified includes a left to
>right paragraph without explicit alignment, the expected
>interpretation would be to left justify this paragraph.

True, but this is the responsability of the browser implementor
to do it that way (or fail on the market). There is no need to
specify that in the standard.

Maybe Francois Yergeau can expand more on the ALIGN attribute.

>6. Form fields
>
>When Hebrew or Arabic text input is expected it should be right
>justified within the window. The same applies when the field was
>left to right but the user elected to type right to left text.
>
>This means that fields should have a direction attribute, but if
>the user chooses another language than expected the display
>should follow the user's input. Of course, only if the other
>language is allowed.
>
>The same applies, of course, to default data that is displayed in
>a field from the VALUE attribute.

From the DTD, it should be obvious that FORMs and form fields
actually have a DIR attribute. If used together with a Unicode
text component, this will very easily give the behaviour you
describe. But there is no need to prescribe any specific
user interface behaviour in the standard.

>I would also expect the browser to set the keyboard language to
>match the base direction attribute of the field when the cursor
>moves into the field. In a Hebrew system, which is normally
>bilingual, Hebrew and English, when the field has right to left
>direction the keyboard language should be set to Hebrew, and when
>the direction is left to right the keyboard language should be
>set to English.

Nice, but this is up to the browser and does not belon into
the standard. Otherwise, our draft becomes a manual of i18n
user interface guidelines.

>With the data of each field, the form should return the direction
>attribute actually selected by the user in addition to the
>character set.

This idea is new. Is there any specific application where this
would be needed, or a wide general requirement? In those probably
rather few cases where it is really needed, why not just add
a button where the user can set this information?

>As proposed, the form should be able to restrict the user's input
>to a specific character set, according to the requirements of the
>server.

Can you make a proposal of how this could be done? Should some
relation between the "charset" parameter of the document and
of the field be allowed? What if I obtain a document via a conversion
proxy in Unicode?

>The caveat in paragraph 5.2 should be expanded: In the case of
>certain characters, their representation may be changed. For
>example, there are two valid codings in UCS-2 for "e grave": a
>composed character, and a base character followed by a diacritic.
>Some people think the canonical form should be the composed form,
>others prefer the decomposed form, but in any case it is possible
>that the user agent will convert from one form to the other.

Can be done.

>In a right to left form with selections, the check boxes and
>radio buttons are on the right and the VALUE on the left, right
>justified.

Again an UI issue that does not belong in the standard.

>7. Tables
>
>Tables with the right to left attribute should be arranged from
>right to left, i.e. the first column is the rightmost column. In
>the absence of specific attributes, each cell should by default
>be justified according to its base direction.

UI issue, same as above.

>8. MIME
>
>Currently, the Hebrew standards for MIME are defined in:
>
>RFC 1555 Hebrew Character Encoding for Internet Messages
>
>RFC 1556 Handling of Bi-directional Texts in MIME
>
>
>The following specification is compatible with HTML I18N:
>
> Content-type: text/plain; charset=ISO-8859-8-i
>
>
>
>The following specifications are not compatible and should not be
>used:
>
> Content-type: text/plain; charset=ISO-8859-8-e
>
> Content-type: text/plain; charset=ISO-8859-8

Interesting and helpful. I think there should be a comment
saying that some "charset" parameters imply a specific
directionality behaviour, and therefore should not be used.
The above cases can serve as examples.

>9. Geometry
>
>The display geometry in HTML assumes left to right direction.
>
>In a right to left direction there are two possibilities:
>
>- the geometry of the screen is not changed. It is up to the
>author to lay out his text, images, forms and frames based on the
>existing mechanisms to obtain the desired layout.
>
>- the geometry is "mirrored". An application (for example, a CGI
>application) designed for left to right will thus function
>correctly in a right to left environment without requiring
>modification.
>
>An attribute is required on the document level to specify the
>desired behavior.

As far as HTML is concerned, text and tables should take care of
this, on the browser side. Things such as Java are beyond this
standard. CGI scripts are completely on the server side;
directionality behaviour depends on what they serve, not on
the script itself.

>10. ALT text
>
>Since Unicode will be the basic character set underlying HTML,
>there is no reason to restrict the character set of the ALT text.
>Of course, it is up to the author to design his page so that he
>does not send the user text he cannot see. But if the text is in
>Hebrew, there is no reason not to allow ALT text in Hebrew.

Good point. But maybe there was some specific opposition to
this in an earlier discussion that I don't remember.

>11. URL's
>
>Although not really belonging to this document, I would like to
>mention the need for UCS-2 coded URL's, at least the part after
>the path. Since it is really of no interest to the intermediary
>nodes, the only requirement should be that it be understood by
>the server. If the server supports file names in the local
>language, be it French, Hebrew or Japanese, why not?

I just came back this night from WWW 5 in Paris, where we have
discussed this longstanding topic at the I18N workshop.
The need is there, but the obstacles and problems are also huge.
For an overview of some arguments, see the position papers of
Francois Yergeau and myself for that workshop
http://www.w3.org/pub/WWW/International/workshop9605.html

If you have any ideas of how the problems can be solved, or any
new arguments, we would appreciate it.

>12. Preformatted text
>
>Text under the influence of a <PRE> tag and other tags indicating
>preformatting should be processed on a line by line basis. It
>should be considered preformatted only as far as HTML is
>concerned, not on the character level.

This would mean that bidi (maybe restricted to two levels because
neither markup nor RLE/LRE can be used) would apply, as well as
cursive formatting and so on. Is this sufficient for all cases?

>Appendix - Background information

>1. Who am I?
>
>I am an independent consultant, working in Israel and involved
>with national and international Hebrew standards for many years.
>I was a member of the ECMA and ISO working groups that worked on
>ECMA TR/53 and ISO/IEC 6429, and contributed to the formation of
>the bidi and Hebrew elements in Unicode and ISO 10646. Currently
>I am a member of the Standards Institution of Israel (SII)
>technical committee 1109, which is responsible for basic
>standards regarding Hebrew in computing.
>
>SII TC 1109 has discussed the subject, and agrees in principle
>with these comments. Due to the short time we had not reviewed
>them.

>There is one point, however, that we did formally agree on: the
>committee will not agree to yet another Hebrew character level
>standard, which means that the implementation of Unicode in HTyML
>should be 100% conformant to the Unicode specification.

The current i18n draft is supposed to become an internet standard,
and internet standards are discussed and agreed on in other ways
than ISO or national standards. Any formal pressure by other
groups is in general not very well received.

Apart from that, I agree with you that conformance is important.
Also, we wellcome any comments as long as they are based on
technical arguments, and we definitely don't want a standard
that disfavours some groups or gives them implementation
headaches.

But our draft is not "yet another Hebrew character
level standard", it is a higher-level protocol and as such
very well conforms to the Unicode standard. We also tried to
solve, as best as possible, the problems with the meta-ness
of markup and the interference with the bidi algorithm.

From your explanations, I was not able to find out how you
have addressed or plan to address this problem, or whether
this has ever been considered.

>2. Bidi Concepts
>3. Unicode and ISO 10646 (UCS-2)

Thanks for the explanations. As far as the authors of the draft
are concerned, you can rest assured that although we may not
use bidi text in our everyday life, we are sufficiently familliar
with the issues and the specifications.

>4. SGML
>
>The SII has recently approved Israeli Standard SI 1680, "Hebrew
>Implementation in SGML". The SII is preparing an English summary
>and a request to include our extensions in the relevant ISO
>standards.
>
>SI 1680 includes the following elements:
>
>- additional tags to support bidi
>
>- the specification of the global direction
>
>- entity names for the Hebrew characters
>
>- Hebrew tags for Hebrew general documents
>
>The global direction of the document is derived from language
>specification in the DTD (Hebrew being "iw").
>
>SI 1680 proposes Hebrew tags, corresponding to the English tags
>for general documents in ISO/IEC/TR 9573-1988. The use of a
>Hebrew tag implies right to left base directionality for the
>relevant document element.

Does "Hebrew tag" mean that the tags are actually written
with Hebrew letters, or just that they are additional tags
written with Latin letters, such as <ph> below?
Please note that we have considered allowing (almost) any
character from ISO-10646 in tags, but have not done it because
we have met too much resistance, and because it would have
been too much work for not enough benefit.

>A new tag, <ph>, indicates a Hebrew (right to left) paragraph,
>whereas indicates a left to right paragraph. The document
>elements in the DTD include <ph> or as appropriate to
>indicate their base direction.

This is neither very general (the Arabs will prefer <pa>) and
nor very structured (mixing attributes with entities).

>Since SGML is not restricted to Unicode, and other character
>codes, such as ISO 8859-8, do not require implicit directionality
>and do not include formatting codes, additional tags were
>introduced to specify character level directionality, basically
>in parallel with the Unicode formatting codes: RTL, LTR and IMD.
>The default is, as in Unicode, implicit directionality. RTL and
>LTR specify right to left and left to right directionality, the
>same as Unicode RLO and LRO, </RTL> and </LTR> are equivalent to
>PDF, and IMD (Implicit Directionality) specifies implicit
>directionality. These tags may be nested.

In this description, I miss RLE and LRE. If they are indeed
supported, it would be nice if you tell us how this is done.
If they are not supported, I would be interested to know why.
Also, I don't understand whith what IMD is paralles in Unicode.
Also, I would like to know what SI 1680 prescribes if the
formatting characters are indeed available. For example,
is it allowed to start an overriding level with an <RLO> tag
and end it with a PDF character, or start with an LRO character
and end with </LTR>? We have considered this, as Gavin
mentionned, using shortrefs, but have rejected it for
various reasons, such as:

- Shortrefs are not supported in most current HTML browsers
(this is a place where there are big differences between
HTML and general SGML).

- Combinations as given above can be very misleading to
a document author.

- There is of course the problem of editing the marked-up
document (meta-level).

>Appendix B. Entity names for the Hebrew Characters

We have thought about including more than the current Latin-1
characters as named entities. It might not have been too
difficult for a few well-defined sets, but overally, it
might easily have become a work without end, might
have lead to some difficult decisions (esp. in the case of
conflicts) and would have expanded our draft too much.

The only thing we have done is to add named entities for
those characters in Latin-1 that were still missing.
Because Latin-1 is the default for HTML (over HTTP),
this is more of a completion of previous misses than
anything else.

As you don't make any explicit proposal for the inclusion
of these entities, I assume that in this respect, what
we have done is acceptable. But then I have difficulties
to understand why you have posted this long list.

>Jonathan (Jony) Rosenne
>JR Consulting
>P O Box 33641, Tel Aviv, Israel
>e-mail: 100320.1303@CompuServe.com

I hope that in the interest of all concerned parties, we can
deal with these comments as fast as possible. The main place for
discussion, for everything that concerns the draft, is of course
the html-wg mailing list. I am cross-posting to unicode because
many things in this discussion relate to Unicode, and I hope
to get good comments from there, too.

Regards, Martin.

----
Dr.sc.  Martin J. Du"rst			    ' , . p y f g c R l / =
Institut fu"r Informatik			     a o e U i D h T n S -
der Universita"t Zu"rich			      ; q j k x b m w v z
Winterthurerstrasse  190			     (the Dvorak keyboard)
CH-8057   Zu"rich-Irchel   Tel: +41 1 257 43 16
 S w i t z e r l a n d	   Fax: +41 1 363 00 35   Email: mduerst@ifi.unizh.ch
----
----
Dr.sc.  Martin J. Du"rst			    ' , . p y f g c R l / =
Institut fu"r Informatik			     a o e U i D h T n S -
der Universita"t Zu"rich			      ; q j k x b m w v z
Winterthurerstrasse  190			     (the Dvorak keyboard)
CH-8057   Zu"rich-Irchel   Tel: +41 1 257 43 16
 S w i t z e r l a n d	   Fax: +41 1 363 00 35   Email: mduerst@ifi.unizh.ch
----

Next message: Martin J Duerst: "Re: I18N of HTML - Hebrew"
Previous message: Jonathan Rosenne: "Re: I18N of HTML - Hebrew"
Maybe in reply to: Jonathan Rosenne: "Re: I18N of HTML - Hebrew"
Next in thread: Martin J Duerst: "Re: I18N of HTML - Hebrew"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT