From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Aug 07 2009 - 17:24:58 CDT
Asmus responded:
> > What would be the best course of action?
> >
> > Things I have considered:
> > A) Work with the Unicode consortium to make sure that there are
> > normative properties for the purpose.
> >
> I think it would be ill advised for the UTC to create a normative
> property just for this purpose.
I concur with that assessment.
>
> However, a general informative property, that identifies which paired
> characters were encoded as a pair might be of interest.
>
> Such a property could and should be extended to other paired characters,
> lik the arrows, because, those are absent from BidiMirroringGlyph (as
> arrows are not mirrored) but knowing which arrow is the directional pair
> of which other one is useful nevertheless.
Of note here is that unlike the property envisioned by Joachim, which
would enable, for example, highlighting of (...) pairs and the
like in programming editors aware of the syntax rules, a
task to identify "paired characters" in Unicode would extend
to intentionally encoded pairs symmetric around other axes --
not merely to the left/right pairs relevant to bidirectional
behavior. Thus:
U+2190 LEFTWARD ARROW
U+2192 RIGHTWARDS ARROW
but also:
U+2191 UPWARDS ARROW
U+2193 DOWNWARDS ARROW
It would also include partial symmetries:
U+2272 LESS-THAN OR QUIVALENT TO
U+2273 GREATER-THAN OR EQUIVALENT TO
It would be extended to quad-set symmetries (which also apply
to the arrows), such as the crop characters, U+230C..U+230F.
And to other kinds of set symmetries, such as those involved
in the box drawing characters (U+2500..U+257F) or the
glyph fragment sets (U+239B..U+21B3, etc.), the trigram,
tetragram, and hexagram symbol sets, Braille pattern symbols,
etc., etc.
And rotational and reflectional symmetries amongst letters,
which usually aren't considered by people thinking about
syntactic bracket matching tasks, but which nonetheless represent
significant pairings for other purposes:
U+0074 LATIN SMALL LETTER T
U+0287 LATIN SMALL LETTER TURNED T
U+1681 OGHAM LETTER BEITH
U+1686 OGHAM LETTER UATH
U+A846 PHAGS-PA LETTER JA
U+A855 PHAGS-PA LETTER ZA
U+1489 CANADIAN SYLLABICS CE
through
U+14A0 CANADIAN SYLLABICS CWAA
etc.
>
> It is a legitimate task for Unicode to identify such paired characters,
> especially as they are not always coded together. Such a property speaks
> primarily to the *identity* of the character, which is the primary
> concern of the UTC when encoding character. The property would not
> primarily speak to how such characters are *used*, because, as I've
> tried to indicate, such usage rules are too varied, too context and
> language dependent, to be captured by a Unicode character property.
I concur.
> > B) Require that a programming language is nailed down to a single
> > version of Unicode. (I think Java essentially does this.)
> >
> Syntax definitions should be nailed down like that. It's not as much of
> a burden, because the kinds of characters you describe here are not
> being added in large numbers.
Furthermore, it is important, for a syntax definition, to
distinguish between a true syntax requirement versus a desire
to include more generic multilingual text as part of the
"non-syntax" (i.e., as identifiers, comments, and such), while
still being able to do parse-like operations on it. General
text is just intractable in that regard, and should, IMO, be
left to general word-processing programs, rather than attempting
to incorporate rules for it into programming languages.
As Asmus pointed out, even something as simple as parenthesis
matching is intractable in general text, if for no other reason
than that people use them asymmetrically in many contexts,
as witness your own usage:
> > C) Require that programming languages with this kind of Unicode support
^^^
I think that a programming language syntax definition should
limit its use of paired delimiters to a well-defined set,
whose behavior is well-defined by the syntax.
I would recommend limiting that set to paired delimiters which
already have the Unicode property Pattern_Syntax=True, which
will constrain the problem to the kinds of punctuation and
symbols that are most useful for such purposes, and which
are less entangled with script- and/or language-specific
behavior.
And then even within the set of Pattern_Syntax=True characters,
there are lots which should be avoided in any formal syntax
definitions. The quotation marks are obvious examples.
Quoting within formal syntax is tricky enough as it is.
Adding numerous quotation marks which are used differently
in different typographical traditions, and expecting people
to use and interpret them consistently, is just asking for
trouble. It is far safer to stick to a very small set
of well-defined quoting characters and conventions, rather
than thinking the programming language would be somehow
improved by opening it to use of everything that is
Quotation_Mark=True in the Unicode Character Database.
--Ken
This archive was generated by hypermail 2.1.5 : Fri Aug 07 2009 - 17:27:35 CDT