Re: Generic Tagging: A Modest Proposal

From: Martin J. Duerst (mduerst@ifi.unizh.ch)
Date: Thu Jul 17 1997 - 08:33:17 EDT


On Wed, 16 Jul 1997, Glenn Adams wrote:

> At 12:42 PM 7/15/97 -0700, Markus G. Kuhn wrote:

> >Many systems have already their own language tagging mechanism and do
> >not need an additional one from Unicode. For instance, in HTML 4.0
> ><http://www.w3.org/TR/WD-html40/>, you can write things like
>
> While HTML and other application conventions solve the language tagging
> problem in particular domains, they do not do so in a way that satisfies
> this requirement in plain text domains. The proposed mechanism does not
> conflict with the HTML mechanism; indeed, the HTML mechanism would be
> preferred in that context. Note that a similar issue arises with respect
> to bidirectional overrides and embedding levels. 10646 encodes these
> directly as their absence would preclude minimum legibility of many bidi
> texts in the plain text context. However, when using a richer representation,
> like HTML, these should generally be replaced with markup at the higher
> level. Langauge information can be handled similarly.

There are some similarities between the BIDI "control" characters and
the language tags, but also some important differences:

- The need for BIDI information turns up as soon as you cannot
        guarantee the width of a displayed text anymore. It assures
        that words are given the right sequence within a line,
        and is therefore rather crucial for basic readability.
        Language information has various applications, but they
        are all related to much more sophisticated operations
        than variable-width formatting, and basic readability
        is not an issue.

- Not surprisingly, BIDI codes have a long tradition in plain
        text, and HTML markup has been modelled after this
        tradition and existing standards. Language tags, also
        not surprising, don't have much of a tradition in
        plain text, and the proposals currently discussed are
        modelled after marked-up text. Please don't say that
        this is due to Unicode; if language tags were that
        seriously necessary, they would already have been
        introduced for iso-8859-1 and many other "charset"s.

- In RFC 2070 and in HTML 4.0, BIDI "control" codes are allowed
        in parallel with HTML BIDI markup, but their use is
        highly discouraged because it's very difficult to
        keep both variants in sync, and to distinguish between
        bidi information for the markup and for the final text
        when editing raw HTML. Tolerance for BIDI "control" codes
        was added at a rather late stage on a request from an
        Israeli specialist who worried about the ease of
        including existing plain text into HTML.
        With respect to the currently discussed "plain-text
        language tags", there is neither a need nor a plan
        to allow them in HTML. HTML has its mechanism providing
        language information, and other formats can choose
        between conversion or failure.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT