I do hope the TUC will not accept the "plane 14" proposal due
to its restrictive nature: it allows "ordinary" strings in/as tags
but only allows 7-bit ASCII characters in those strings.
Add a single codepoint, TAG INDICATOR, to Unicode/10646, at U+FFFB.
It indicates that one Unicode/10646 character immediately preceding it
is (part of) a tag. For "surrogate pairs" (UTF-16 extension), the TAG
INDICATOR applies to the pair, since such a pair encode a single character.
In a composing character sequence, the entire sequence being part of a
tag, each part must be marked by a TAG INDICATOR, e.g. for a decomposed Å
in a tag: A+TAG INDICATOR+COMBINING RING ABOVE+TAG INDICATOR.
This is equivalent to letting the entire plane 14 be a copy
of plane 0, including "surrogate" characters.
When rendering the text, the pair of a Unicode/10646 character
followed by a TAG INDICATOR are normally *completely* ignored *as
characters*. Tags may, however, by their semantics affect the rendering,
e.g. by affecting automatic hyphenation. Tags may still be rendered in
special modes where tags are made visible.
TAG INDICATORs occurring at the very beginning of a string or
directly after a TAG INDICATOR are disallowed in the same way as
U+FFFF and U+FFFE are disallowed.
This suggestion allows any Unicode character to be a part of
a tag via this mechanism. Even if not used immediately, there are
here no restrictions on which characters that can occur in tags
marked by TAG INDICATORs. The TAG INDICATOR makes tags of any
syntax easy to ignore for software that does not process the tags
(except for keeping them), but processes the text between tags.
This suggestion is put forward in response to the "Plane 14"
proposal which has a marked restriction to 7-bit ASCII, which is
lacking in foresight in what characters may be used in this kind of
tags in the future.
For simple tagging purposes, have a simple tagging scheme
where, for simplicity, like tags nest. There seem to be a desire
to let non-like tags not nest, however.
Syntax (described informally):
where the parts between « and », inclusive, all the characters have a
COMBINING META immediately after each of them.
Same keyword tags nest, different keyword tags do NOT nest.
Tags with unrecognised keywords can then be completely ignored.
The French quotes are there only to make it easier to
restore/insert the COMBINING METAs. They could be dropped
if you think the COMBINING METAs will always be there.
This simple syntax is sufficient for language tagging,
original charcode tagging, and annotation-type tagging.
This suggestion is put forward in response to the
"plane 14" proposal's syntax for tags, where some tags have
a very different syntax from other tags, and where no tags
nest, which adds complexity when handling tagged text.
Though the syntax is a little bit similar to some higher
level markup languages, the syntax is still much simpler,
and need even not be understood by processes that ignore
Introduce the keyword "lan" (or "language" or "L"; pick one),
and a set of identifiers for languages. An example string could
then be (COMBINING METAs not shown):
(ÅL = Åland) I.e.:
« c.m. l c.m. a c.m. n c.m. = c.m. s c.m. v c.m. - c.m. A c.m. c.r.a. c.m.
L c.m. » c.m. s j u t t i o s j u « c.m. § c.m. l c.m. a c.m. n c.m. » c.m.
where c.m.=COMBINING META and c.r.a.=COMBINING RING ABOVE.
Since this is supposed to be a "low/medium level" tagging,
as opposed to "high level" tagging (e.g. HTML/XML/SGML/(La)TeX/...),
it should be very hard to add new keywords to the tagging scheme.
Adding new identifiers (see the informal syntax), however, should be
easier and may benefit from some variant of John Cowan's Java-inspired
suggestion (e.g. «lan=sv-SE.im_Kent_No·lhôtte»Hössen haru't?«§lan»).
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT