Re: Plane 14 redux (was: Same language, two locales)

From: Doug Ewell (dewell@compuserve.com)
Date: Sun Sep 03 2000 - 11:44:13 EDT


Peter,

Thanks for your response.

> But, the problems with UTR#7 making a normative reference to a
> particular system for language identification are (a) that systems
> get revised (RFC 1766 will become obsolete before long),

This is one reason I have suggested making reference to ISO standards
in the past, rather than RFCs. When ISO standards get revised, they
retain the number and name of the earlier version, so documents that
reference those standards are *automatically* updated to the new ISO
revision. RFCs are not revised, but rather replaced, so documents that
refer to an RFC are linked to that version forever, or at least until
the referring document is updated.

> and (b) that it's doing so in the absence of any given context or
> application (apart from saying that it's plain text). What if, as you
> suggest, someone in a given context would rather use ISO 639-2? The
> Unicode Consortium shouldn't care, and that person's data shouldn't
> be deemed non-conformant to the Unicode Standard simply because they
> used ISO 639-2 rather than RFC 1766. The Unicode Consortium should
> only care that *characters* get used in a particular way; it's kind
> of like a UTR specifying that "color" must be spelled without a "u" -
> making rules about how characters can be combined in areas that have
> nothing to do with the properties of the characters themselves.

I admit that my later comment, "UTR #7 falls into a somewhat different
category from other Unicode mechanisms," touches on a gray area. I
would suggest, however, that the characters we are discussing are not
the normal ASCII alphabet from U+0020 to U+007E, but rather the special
tag characters from U+E0020 to U+E007E. Unlike ASCII, these characters
are for use only within tags, so it might be legitimate for UTR #7 to
specify exactly how they are to be used.

Note that "conformance" in this discussion is to UTR #7 only, *not* to
the Unicode Standard itself. If UTR #7 were a Unicode Standard Annex
(UAX), this would not be the case.

> Let's understand something. Language tags composed of plane 14
> characters are a form of markup, and I'd say that a document that
> contains them isn't strictly speaking plain text. It's just that the
> markup is done in a way that's different from other, more familiar
> markup mechanisms.

Arghhmmgmhmmm. Let's look at the definition of "plain text" in the
Glossary of TUS 3.0 (p. 993):

"Computer-encoded text that consists *only* of a sequence of code
values from a given standard, with no other formatting or structural
information. Plain text interchange is commonly used between computer
systems that do not share higher-level protocols."
(original emphasis)

Plane 14 characters are code values from the Unicode Standard (or will
be as soon as a suitable version of Unicode refers to them). They do
not employ any formatting or other mechanism external to the Unicode
Standard. If Plane 14 characters can be considered markup, then so can
directional overrides, layout controls, and even C0 controls like CR
and LF. Of course, nobody would ever consider CR and LF *not* to be
plain text, so where do we draw the line? I suggest simply observing
the line the Unicode Consortium has drawn.

> (I'll be presenting on this topic next week - come and hear, if
> you're interested.)

I am intensely interested, bot unfortunately my work schedule probably
won't permit any travel at present. I would be interested in any
transcripts, summaries, etc. from your presentation.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT