Re: "plain text" and plane 14 lang tags

From: DougEwell2@cs.com
Date: Tue Feb 05 2002 - 03:51:03 EST


In a message dated 2002-02-04 9:07:22 Pacific Standard Time,
Peter_Constable@sil.org writes:

>> In plain text, I think that plane 14 language tags could be used
>
> It seems to me that such usage confuses the meaning of "plain text". Use
> of the plane 14 tagging characters to indicuate language would be markup
> -- metadata that is separate from the content and that has some impact on
> how the content should be processed.

I'm afraid this is one place where Peter and I are forever destined to
disagree. While Plane 14 tags do perform a markup-like function -- just as
the directional overrides and variation selectors do -- they are discrete
Unicode characters, and so, by definition, they are plain text. From TUS
3.0, page 16: "The Unicode Standard encodes plain text."

> It's just a coincidence that the
> markup uses distinct characters from the content.

It's not a coincidence at all. Plane 14 in general, and the specific code
points in particular, were intentionally chosen to ensure that the tag
characters would not conflict with any other characters.

In HTML, the string "<span lang="xh">" -- a sequence of ordinary ASCII
characters -- has a special, higher-level meaning that is defined by the
markup language. In another context, that string might not have the same
meaning; another string might convey that meaning, or there might not be any
such markup available.

By contrast, the Unicode sequence U+E0001 U+E0078 U+E0068 has only one
meaning, defined by the character encoding standard as clearly as it defines
the letter A (if not more so).

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell@adelphia.net)



This archive was generated by hypermail 2.1.2 : Tue Feb 05 2002 - 03:25:21 EST