Re: Matching opening and closing characters: How?

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Aug 07 2009 - 17:24:58 CDT

  • Next message: Joachim Durchholz: "Re: Matching opening and closing characters: How?"

    Asmus responded:

    > > What would be the best course of action?
    > >
    > > Things I have considered:
    > > A) Work with the Unicode consortium to make sure that there are
    > > normative properties for the purpose.
    > >
    > I think it would be ill advised for the UTC to create a normative
    > property just for this purpose.

    I concur with that assessment.

    >
    > However, a general informative property, that identifies which paired
    > characters were encoded as a pair might be of interest.
    >
    > Such a property could and should be extended to other paired characters,
    > lik the arrows, because, those are absent from BidiMirroringGlyph (as
    > arrows are not mirrored) but knowing which arrow is the directional pair
    > of which other one is useful nevertheless.

    Of note here is that unlike the property envisioned by Joachim, which
    would enable, for example, highlighting of (...) pairs and the
    like in programming editors aware of the syntax rules, a
    task to identify "paired characters" in Unicode would extend
    to intentionally encoded pairs symmetric around other axes --
    not merely to the left/right pairs relevant to bidirectional
    behavior. Thus:

    U+2190 LEFTWARD ARROW
    U+2192 RIGHTWARDS ARROW

    but also:

    U+2191 UPWARDS ARROW
    U+2193 DOWNWARDS ARROW

    It would also include partial symmetries:

    U+2272 LESS-THAN OR QUIVALENT TO
    U+2273 GREATER-THAN OR EQUIVALENT TO

    It would be extended to quad-set symmetries (which also apply
    to the arrows), such as the crop characters, U+230C..U+230F.
    And to other kinds of set symmetries, such as those involved
    in the box drawing characters (U+2500..U+257F) or the
    glyph fragment sets (U+239B..U+21B3, etc.), the trigram,
    tetragram, and hexagram symbol sets, Braille pattern symbols,
    etc., etc.

    And rotational and reflectional symmetries amongst letters,
    which usually aren't considered by people thinking about
    syntactic bracket matching tasks, but which nonetheless represent
    significant pairings for other purposes:

    U+0074 LATIN SMALL LETTER T
    U+0287 LATIN SMALL LETTER TURNED T

    U+1681 OGHAM LETTER BEITH
    U+1686 OGHAM LETTER UATH

    U+A846 PHAGS-PA LETTER JA
    U+A855 PHAGS-PA LETTER ZA

    U+1489 CANADIAN SYLLABICS CE
    through
    U+14A0 CANADIAN SYLLABICS CWAA

    etc.
     
    >
    > It is a legitimate task for Unicode to identify such paired characters,
    > especially as they are not always coded together. Such a property speaks
    > primarily to the *identity* of the character, which is the primary
    > concern of the UTC when encoding character. The property would not
    > primarily speak to how such characters are *used*, because, as I've
    > tried to indicate, such usage rules are too varied, too context and
    > language dependent, to be captured by a Unicode character property.

    I concur.

    > > B) Require that a programming language is nailed down to a single
    > > version of Unicode. (I think Java essentially does this.)
    > >
    > Syntax definitions should be nailed down like that. It's not as much of
    > a burden, because the kinds of characters you describe here are not
    > being added in large numbers.

    Furthermore, it is important, for a syntax definition, to
    distinguish between a true syntax requirement versus a desire
    to include more generic multilingual text as part of the
    "non-syntax" (i.e., as identifiers, comments, and such), while
    still being able to do parse-like operations on it. General
    text is just intractable in that regard, and should, IMO, be
    left to general word-processing programs, rather than attempting
    to incorporate rules for it into programming languages.

    As Asmus pointed out, even something as simple as parenthesis
    matching is intractable in general text, if for no other reason
    than that people use them asymmetrically in many contexts,
    as witness your own usage:

    > > C) Require that programming languages with this kind of Unicode support
        ^^^

    I think that a programming language syntax definition should
    limit its use of paired delimiters to a well-defined set,
    whose behavior is well-defined by the syntax.

    I would recommend limiting that set to paired delimiters which
    already have the Unicode property Pattern_Syntax=True, which
    will constrain the problem to the kinds of punctuation and
    symbols that are most useful for such purposes, and which
    are less entangled with script- and/or language-specific
    behavior.

    And then even within the set of Pattern_Syntax=True characters,
    there are lots which should be avoided in any formal syntax
    definitions. The quotation marks are obvious examples.
    Quoting within formal syntax is tricky enough as it is.
    Adding numerous quotation marks which are used differently
    in different typographical traditions, and expecting people
    to use and interpret them consistently, is just asking for
    trouble. It is far safer to stick to a very small set
    of well-defined quoting characters and conventions, rather
    than thinking the programming language would be somehow
    improved by opening it to use of everything that is
    Quotation_Mark=True in the Unicode Character Database.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Aug 07 2009 - 17:27:35 CDT