From: Jim Allan (jallan@smrtytrek.com)
Date: Wed Feb 05 2003 - 16:47:20 EST
James Kass posted:
> The advantages of using P14 tags (...equals lang IDs mark-up) is
> that runs of text could be tagged *in a standard fashion* and
> preserved in plain-text.
But this still would not necessarily handle orthographic variations.
See Peter Constable's discussion of language classifcation and
orhographic classification at http://www.unicode.org/notes/tn8.
Currently standard language tagging or orthographic tagging that is
logically no more than a kludge once it tries to go beyond obvious
different languages that are unintelligible to users of other languages.
Which language tag protocol should Unicode adopt? Should it create its
own? That last seems beyond the mandate of Unicode.
There are often conflicting orthographic usages within a language.
Language tagging alone does not indicate whether German text is to be
rendered in Roman or Fraktur, whether Gaelic text is to be rendered in
Roman or Uncial, and if Uncial, a modern Uncial or more traditional
Uncial, whether English text is in Roman or Morse Code or Braille.
Capital Eng is found in both pointed and rounded forms in Sami texts and
printed names, so far as I have read.
The pointed Eng is more common.
Does that mean it is "preferred" or only that it happens to be the more
common form in available fonts?
Perhaps the rounded Eng is actually "peferred" by most.
Perhaps most don't care at all, any more than they care whether the hook
on a _J_ descends below the baseline, whether the descender on _g_ is
open or closed, whether _a_ is rendered with an upper curl or not.
Certainly language tagging shouldn't be used to distinguish between such
forms, unless specifically requested by organizations that can show that
their request is supported by a very large proportion of the users of
the language.
But even then, do not those who disagree have the right to dissent, to
push their own desires in spelling or orthography?
Language tagging and orthography tagging is not all that is needed.
One sometimes *needs* to show emphasis, for example in a database of
books and articles one may need to catalogue titles like "Comments on
the _Tao_Te_Ching_" (see http://www.friesian.com/taote.htm).
To be correct, the book title *must* be italicized, unless the article
title appears in italicized text, in which case it should be non-italic
to contrast.
Titles of articles in mathematics or chemistry may contain superscript
and subscript characters beyond those hard-coded in Unicode.
These cannot be indexed in a database as plain text.
Plain text is not adequate for *so much* normal use. But who ever
claimed it was? Plain text is only the underlying text, which is
sometimes, alone, sufficient.
At the moment XML seems to be the mark-up protocol towards which most
are moving, and there seems to be no point in duplicating its features
in Unicode, unless Unicode can somehow do it better.
Jim Allan
This archive was generated by hypermail 2.1.5 : Wed Feb 05 2003 - 17:34:03 EST