Re: Matching opening and closing characters: How?

From: Mark Davis ⌛ (mark@macchiato.com)
Date: Fri Aug 07 2009 - 13:01:46 CDT

  • Next message: karl williamson: "Re: Wanted: synonyms for Age"

    We get this question often as well.

    Mark

    On Fri, Aug 7, 2009 at 00:46, Joachim Durchholz <jo@durchholz.org> wrote:

    > Hi,
    >
    > I'm trying to make a piece of software determine if an opening and a
    > closing character match.
    >
    > E.g. ( matches ), [ matches ], « matches ».
    >
    > I'm looking for a workable approach to base that on Unicode character
    > properties, but I see lots of problems and would appreciate any advice
    > how to proceed.
    >
    >
    > Problem group 1: Determining which characters are in the "quotation
    > marks" set and in the "parentheses" sets, respectively.
    >
    > 1a) For quotation marks, General Category: Initial Quote Punctuation and
    > Final Quote Punctuation is a good first approximation, but it's missing
    > some characters (particularly the ASCII single and double quotes " and
    > ', but also e.g. U+FF02 Fullwidth Quotation Mark).
    >
    > 1b) The Quotation Mark property is more complete (in particular it does
    > contain " and ' ), but it is just informative and hence not subject to a
    > stability policy. That's a no-go for a programming language - imagine
    > all strings turning into syntax errors because the Unicode consortium
    > decides to drop the Quotation Mark property from " !
    >
    > 1c) Given the problems with quote punctuation, I'm worrying that General
    > Category: Start Punctuation and End Punctuation may be incomplete as
    > well. I can't check that, partly because the character set is so huge
    > and partly because I'm no expert in foreign character sets.
    >
    > 1d) There seem to be errors in the categorization of some characters.
    > U+201a ‚ SINGLE LOW-9 QUOTATION MARK strikes me as a quote (Gc: Initial
    > Quote Punctuation), but its General Category is Start Punctuation just
    > like Left Parenthesis.
    > The same goes for U+201e „ DOUBLE LOW-9 QUOTATION MARK.
    >
    >
    > Problem group 2: How to determine that two characters match?
    >
    > Assuming I have an opening and a closing character and know they're
    > either parentheses or quotes: On what criteria could I base that they
    > match or don't?
    > E.g. ( would match ), but ( would not match ].
    >
    > 2a. I found only one property that even lists another character as its
    > property value, namely Bidi Mirroring Glyph. However, it is informative
    > again.
    >
    > 2b. It does not cover vertical scripts: characters intended for use in
    > vertical context such at U+23b4 Top Square Bracket don't have a Bidi
    > Mirroring Glyph.
    >
    > 2c. It may be erroneous, too: U+fd3e Ornate Left Parenthesis is not
    > linked up with U+fd3f Ornate Right Parenthesis.
    >
    >
    > I'll want to "normalize away" compatibility characters and confusables.
    > This may take care of the problems with concrete character groups that I
    > listed as potentially erroneous above (they may get rejected or
    > normalized away anyway).
    > I haven't opened that big can of worms that "confusables" represents
    > though. Yet - it's the next big thing on my reading list.
    >
    >
    > What would be the best course of action?
    >
    > Things I have considered:
    > A) Work with the Unicode consortium to make sure that there are
    > normative properties for the purpose. I'm not sure that that's possible,
    > it may turn out to be too big a burden for me and/or unwelcome from the
    > side of the Unicode consortium. (I'm a private person, so a membership
    > is probably too expensive and/or too taxing on my time.)
    > B) Require that a programming language is nailed down to a single
    > version of Unicode. (I think Java essentially does this.)
    > C) Require that programming languages with this kind of Unicode support
    > start with a marker that nails down the Unicode version. (This is highly
    > undesirable as it makes copying and pasting code an inherently
    > unreliable operation: pasted code may have its semantics changed because
    > the new context assumes a different version of Unicode.)
    >
    >
    > Any insights and advice appreciated.
    >
    > Regards,
    > Jo
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Aug 07 2009 - 13:04:15 CDT