Re: Plane 14 redux (was: Same language, two locales)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Sep 05 2000 - 15:48:15 EDT


Doug Ewell countered (towards the end of a long thread on this topic).

As one of the coauthors of Plane 14, and the only one still
standing on the field at the moment, I guess I should weigh
in a bit now.

> > But, the problems with UTR#7 making a normative reference to a
> > particular system for language identification are (a) that systems
> > get revised (RFC 1766 will become obsolete before long),
>
> This is one reason I have suggested making reference to ISO standards
> in the past, rather than RFCs. When ISO standards get revised, they
> retain the number and name of the earlier version, so documents that
> reference those standards are *automatically* updated to the new ISO
> revision. RFCs are not revised, but rather replaced, so documents that
> refer to an RFC are linked to that version forever, or at least until
> the referring document is updated.

I am *strongly* opposed to trying to redo the work being done now
for the revision of RFC 1766 in UTR #7 (or any other Unicode Technical
Report).

Just as the UTC expects other standards bodies to make
normative and/or informative references to our standard and technical
reports, I expect to be able to make references to standards
developed by other groups -- including RFC's. This is particularly
the case when the RFC constitutes a "value-add" over bare references
to the standards which it, in turn, makes normative references to.

While it is true that UTR #7 will need updating because of revision
of RFC 1766 (which will end up being RFC WHATEVER at some point),
that is actually a minor part of the revision problem looming --
see below.

> > and (b) that it's doing so in the absence of any given context or
> > application (apart from saying that it's plain text). What if, as you
> > suggest, someone in a given context would rather use ISO 639-2? The
> > Unicode Consortium shouldn't care, and that person's data shouldn't
> > be deemed non-conformant to the Unicode Standard simply because they
> > used ISO 639-2 rather than RFC 1766. The Unicode Consortium should
> > only care that *characters* get used in a particular way; it's kind
> > of like a UTR specifying that "color" must be spelled without a "u" -
> > making rules about how characters can be combined in areas that have
> > nothing to do with the properties of the characters themselves.
>
> I admit that my later comment, "UTR #7 falls into a somewhat different
> category from other Unicode mechanisms," touches on a gray area. I
> would suggest, however, that the characters we are discussing are not
> the normal ASCII alphabet from U+0020 to U+007E, but rather the special
> tag characters from U+E0020 to U+E007E. Unlike ASCII, these characters
> are for use only within tags, so it might be legitimate for UTR #7 to
> specify exactly how they are to be used.
>
> Note that "conformance" in this discussion is to UTR #7 only, *not* to
> the Unicode Standard itself. If UTR #7 were a Unicode Standard Annex
> (UAX), this would not be the case.

Well, here is the rub. UTR #7 soon will have to be *substantially*
revised, because with the advent of Unicode 3.1, incorporating the
contents of 10646-2 (when that content is firmly known, later this
fall), the Plane 14 tag characters *will* finally be formally a part
of the Unicode Standard.

The UTC hasn't decided yet exactly what to do about UTR #7 in that
context. One option would be to upgrade it to a Unicode Standard
Annex, with revisions, since the characters it discusses would then
officially be a part of the standard. Perhaps a more viable option
would be simply to supersede UTR #7, and to incorporate the still
viable information, in revised form, into whatever UAX will be used
to formally define The Unicode Standard, Version 3.1. At that
point, the revision for RFC 1766 can be added to the references,
presumably with appropriate cautionary language that points to
any successor for the revision in the future, so we don't have to
go change it again later.

That said, I am in general sympathy with Peter's opinion regarding
what should be taken as normative about the tag characters and
what not. The normative information that the Unicode Standard
must specify is how the tag characters themselves are interpreted,
and how the specific tag type(s), the cancel tag, and other tag
characters can be used to construct plain text tags. The exact
content of those tags would be beyond what would be normatively
specified this way.

However, there is great benefit in making a very strong recommendation
about the content of language tags -- and making it in the context
of the Unicode Standard itself, rather than someplace else. Tying
them to RFC 1766 (or its successor) makes it possible to actually
use them and expect a general parser to be buildable. More important,
however, in my mind, is the precedent it sets for the *NON*-use of
tag characters, or rather the *NON*-misuse of tag characters. Most
of us, including those of use culpable in the definition of the
tag characters (which John Cowan pointed out were defined to head
off a worse threat to UTF-8) would prefer not to see them in
wide use, but rather the use of standard tagging mechanisms like
XML or HTML. Keeping the defined use of language tag characters
"in house" in the UTC makes it more difficult for arbitrary other
organizations to start proliferating usages of Plane 14 tag
characters that we would rather prefer not to see happen.

>
> > Let's understand something. Language tags composed of plane 14
> > characters are a form of markup, and I'd say that a document that
> > contains them isn't strictly speaking plain text. It's just that the
> > markup is done in a way that's different from other, more familiar
> > markup mechanisms.
>
> Arghhmmgmhmmm. Let's look at the definition of "plain text" in the
> Glossary of TUS 3.0 (p. 993):
[snip]

You're both right, of course. The Plane 14 tag characters, like
"plain text markup" schemes constitute ways of doing markup
in band in plain text -- thereby creating marked-up text that
nevertheless can be handled as plain text by many protocols that
care not to parse out all the markup. Obvious example: the "View
Page Source" option for seeing and editing raw HTML text.

To summarize:

Plane 14 tag characters are plain text characters.

Unicode plain text that has been language-tagged by the use of
language tags constructed with Plane 14 tag characters is
marked-up text: it contains content sections (little snippets
of plain text) and metacontent sections (language tags). As
such, it is fancy text. However, because Plane 14 tag characters
are <i>plain</i> text characters (just like the "<i>" and
"</i>" tags on the previous line), it is *also* plain text,
and can, in principle, be handled by plain text editors,
viewers, and such.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT