From: Mark Davis ⌛ (mark@macchiato.com)
Date: Fri Aug 07 2009 - 13:01:46 CDT
We get this question often as well.
Mark
On Fri, Aug 7, 2009 at 00:46, Joachim Durchholz <jo@durchholz.org> wrote:
> Hi,
>
> I'm trying to make a piece of software determine if an opening and a
> closing character match.
>
> E.g. ( matches ), [ matches ], « matches ».
>
> I'm looking for a workable approach to base that on Unicode character
> properties, but I see lots of problems and would appreciate any advice
> how to proceed.
>
>
> Problem group 1: Determining which characters are in the "quotation
> marks" set and in the "parentheses" sets, respectively.
>
> 1a) For quotation marks, General Category: Initial Quote Punctuation and
> Final Quote Punctuation is a good first approximation, but it's missing
> some characters (particularly the ASCII single and double quotes " and
> ', but also e.g. U+FF02 Fullwidth Quotation Mark).
>
> 1b) The Quotation Mark property is more complete (in particular it does
> contain " and ' ), but it is just informative and hence not subject to a
> stability policy. That's a no-go for a programming language - imagine
> all strings turning into syntax errors because the Unicode consortium
> decides to drop the Quotation Mark property from " !
>
> 1c) Given the problems with quote punctuation, I'm worrying that General
> Category: Start Punctuation and End Punctuation may be incomplete as
> well. I can't check that, partly because the character set is so huge
> and partly because I'm no expert in foreign character sets.
>
> 1d) There seem to be errors in the categorization of some characters.
> U+201a ‚ SINGLE LOW-9 QUOTATION MARK strikes me as a quote (Gc: Initial
> Quote Punctuation), but its General Category is Start Punctuation just
> like Left Parenthesis.
> The same goes for U+201e „ DOUBLE LOW-9 QUOTATION MARK.
>
>
> Problem group 2: How to determine that two characters match?
>
> Assuming I have an opening and a closing character and know they're
> either parentheses or quotes: On what criteria could I base that they
> match or don't?
> E.g. ( would match ), but ( would not match ].
>
> 2a. I found only one property that even lists another character as its
> property value, namely Bidi Mirroring Glyph. However, it is informative
> again.
>
> 2b. It does not cover vertical scripts: characters intended for use in
> vertical context such at U+23b4 Top Square Bracket don't have a Bidi
> Mirroring Glyph.
>
> 2c. It may be erroneous, too: U+fd3e Ornate Left Parenthesis is not
> linked up with U+fd3f Ornate Right Parenthesis.
>
>
> I'll want to "normalize away" compatibility characters and confusables.
> This may take care of the problems with concrete character groups that I
> listed as potentially erroneous above (they may get rejected or
> normalized away anyway).
> I haven't opened that big can of worms that "confusables" represents
> though. Yet - it's the next big thing on my reading list.
>
>
> What would be the best course of action?
>
> Things I have considered:
> A) Work with the Unicode consortium to make sure that there are
> normative properties for the purpose. I'm not sure that that's possible,
> it may turn out to be too big a burden for me and/or unwelcome from the
> side of the Unicode consortium. (I'm a private person, so a membership
> is probably too expensive and/or too taxing on my time.)
> B) Require that a programming language is nailed down to a single
> version of Unicode. (I think Java essentially does this.)
> C) Require that programming languages with this kind of Unicode support
> start with a marker that nails down the Unicode version. (This is highly
> undesirable as it makes copying and pasting code an inherently
> unreliable operation: pasted code may have its semantics changed because
> the new context assumes a different version of Unicode.)
>
>
> Any insights and advice appreciated.
>
> Regards,
> Jo
>
>
>
This archive was generated by hypermail 2.1.5 : Fri Aug 07 2009 - 13:04:15 CDT