Re: Generic Tagging: A Modest Proposal

From: Martin J. Duerst (
Date: Wed Jul 16 1997 - 09:17:07 EDT

On Tue, 15 Jul 1997, Markus G. Kuhn wrote:

> Kenneth Whistler wrote:
> > The currently
> > active proposal is called the "Plane 14" proposal,
> > and has been in active discussion between the UTC
> > and members of the IETF.
> I hope strongly that these language tags will not become directly part
> of ISO 10646-1, but will be described in a separate document, as this
> sounds clearly like a non charset issue to me.

Yes, indeed. I think this is also what Mark is worrying about.
The question is really about conformance and so on. I tried
to give examples of how conformance clauses could look like in

The conformance clauses in the current version leave a lot to
be desired, from my point of view. Tag characters are treated
in the same way as other characters, but because they work
completetly differently, this is not appropriate.

What I would propose is the following:

- These tags are not for use in Unicode/ISO10646 text in general.

- Where they are used, their use has to be specified explicitly
        (which tags, what values are allowed, what is their
        intended semantics (already for language tags, we have
        various possible semantics)).

- Protocols and mechanisms that use them have to make sure that
        they are completely stripped when they interface to the
        outside world, i.e. to anything that not explicitly
        defined that it accepts (a certain kind of) tags
        (with the same semantics).
        [The idea here is not that protocols that use them should
        go on and define new "charset"s and so on, because this
        would lead to an unnecessary proliferation of these
        "charset"s, but that, like up to now, any protocol
        has to specify which characters and combinations are
        legal in what case, and what they mean. It would just
        mean that if a protocol defines that at some point,
        "any Unicode character" can be used, this would by
        default exclude these tag characters (in the same way
        it currently excludes 0x?FFFF,.. in each plane), so that
        there is never a danger that these characters get used
        in places they are not intended to be used (namely
        for real plain text). This is really the main reason
        for their usefullness in the first place.]

- Generic Unicode software is neither required to nor advised
        to interpret the tags in any way. It should treat each
        single tag character as a single character. If it does
        display, it may display them in the same way as unknown
        characters, or with some suitable glyph.
        [Note: This one may look like an attempt to make the
        use of these tags as difficult as possible, but it is
        actually designed as an important feature, allowing
        to do raw text editing and debugging of the tags,
        very crucial for development work.]

Let's keep these things to where they are really neccessary
(or better, probably: where some people think they are really
neccessary), and not have the rest of the world bothered with

> Many systems have already their own language tagging mechanism and do
> not need an additional one from Unicode. For instance, in HTML 4.0
> <>, you can write things like
> <P LANG=de>Dies ist ein Absatz in Deutsch, in den wir
> etwas <Q LANG=en>english text</Q> eingebettet haben.</P>
> In this example, language information is used to switch between
> German and English hyphenation rules.
> See <> for further
> details. You can specify the language in HTML 4.0 using the LANG
> attribute in almost any HTML element. This is much more convenient
> than handling additional new Unicode control characters. I expect
> that Netscape 5.0 will allow you to select fonts per language.

Thanks for mentionning HTML 4.0. There might already be such a feature
in Netscape 4.0, esp. for CJK. But I don't know.

Anyway, as one of the members of the W3C HTML WG, I would like to
invite you to send me any remarks about things related to i18n in
the HTML 4.0 working draft that you think needs improvement.

Regards, Martin.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT