On Tue, 15 Jul 1997, Markus G. Kuhn wrote:
> Kenneth Whistler wrote:
> > The currently
> > active proposal is called the "Plane 14" proposal,
> > and has been in active discussion between the UTC
> > and members of the IETF.
> I hope strongly that these language tags will not become directly part
> of ISO 10646-1, but will be described in a separate document, as this
> sounds clearly like a non charset issue to me.
Yes, indeed. I think this is also what Mark is worrying about.
The question is really about conformance and so on. I tried
to give examples of how conformance clauses could look like in
The conformance clauses in the current version leave a lot to
be desired, from my point of view. Tag characters are treated
in the same way as other characters, but because they work
completetly differently, this is not appropriate.
What I would propose is the following:
- These tags are not for use in Unicode/ISO10646 text in general.
- Where they are used, their use has to be specified explicitly
(which tags, what values are allowed, what is their
intended semantics (already for language tags, we have
various possible semantics)).
- Protocols and mechanisms that use them have to make sure that
they are completely stripped when they interface to the
outside world, i.e. to anything that not explicitly
defined that it accepts (a certain kind of) tags
(with the same semantics).
[The idea here is not that protocols that use them should
go on and define new "charset"s and so on, because this
would lead to an unnecessary proliferation of these
"charset"s, but that, like up to now, any protocol
has to specify which characters and combinations are
legal in what case, and what they mean. It would just
mean that if a protocol defines that at some point,
"any Unicode character" can be used, this would by
default exclude these tag characters (in the same way
it currently excludes 0x?FFFF,.. in each plane), so that
there is never a danger that these characters get used
in places they are not intended to be used (namely
for real plain text). This is really the main reason
for their usefullness in the first place.]
- Generic Unicode software is neither required to nor advised
to interpret the tags in any way. It should treat each
single tag character as a single character. If it does
display, it may display them in the same way as unknown
characters, or with some suitable glyph.
[Note: This one may look like an attempt to make the
use of these tags as difficult as possible, but it is
actually designed as an important feature, allowing
to do raw text editing and debugging of the tags,
very crucial for development work.]
Let's keep these things to where they are really neccessary
(or better, probably: where some people think they are really
neccessary), and not have the rest of the world bothered with
> Many systems have already their own language tagging mechanism and do
> not need an additional one from Unicode. For instance, in HTML 4.0
> <http://www.w3.org/TR/WD-html40/>, you can write things like
> <P LANG=de>Dies ist ein Absatz in Deutsch, in den wir
> etwas <Q LANG=en>english text</Q> eingebettet haben.</P>
> In this example, language information is used to switch between
> German and English hyphenation rules.
> See <http://www.w3.org/TR/WD-html40/struct/dirlang.html> for further
> details. You can specify the language in HTML 4.0 using the LANG
> attribute in almost any HTML element. This is much more convenient
> than handling additional new Unicode control characters. I expect
> that Netscape 5.0 will allow you to select fonts per language.
Thanks for mentionning HTML 4.0. There might already be such a feature
in Netscape 4.0, esp. for CJK. But I don't know.
Anyway, as one of the members of the W3C HTML WG, I would like to
invite you to send me any remarks about things related to i18n in
the HTML 4.0 working draft that you think needs improvement.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT