Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Mon Sep 24 2007 - 07:59:32 CDT

  • Next message: Doug Ewell: "Re: New Public Review Issue: Proposed Update UTS #18"

    >> I played around with the ability to add digraphs to "." and came up
    >> with two methods. The first would be to specifically list them using
    >> syntax such as:
    >
    > I'd just like to point out that a "[ ]" regular expression is defined to
    > match always exactly one character (if it matches at all).

    Correct. Except that a Spanish speaker would consider "ch" to be a
    single character even though you need two code points to represent it.

    > You can write "[abcdef]" as "(a|b|c|d|e|f)" if you like. You can also write
    > "(a|bb|ccc|dddd|eeeee|ffffff)", but there is no form using "[ ]" to match the
    > same thing.

    There is a mechanism (not sure of the origin) for specifying that
    a sequence of code points are to be treated as a character in a
    character class: [\q{ch}]. [abcdef\q{ch}] matches the same thing
    as (a|b|c|d|e|f|ch)

    > "[ ]" exists primarily as an optimisation, because matching 1 character
    > against a set is a fast operation, whereas checking against an unknown number of
    > alternatives of potentially varying lengths ("( | )") is expensive.

    Yes, and it is also more readable (though the \q construct lowers
    readability): [0-9] is much better than (0|1|2|3|4|5|6|7|8|9).

    > So a sequence specified like [^ ] could never match a whole message, or the
    > string "New York": it could only match a single character.

    I have always agreed with this.

    > What exactly this means in the context of Unicode is a different matter, but
    > I imagine some sort of historical consistency is desirable.

    The historical character class is based on ASCII. A Unicode version
    needs to be able to represent any character in any language, so in
    some cases, you need to specify multiple code points. This is not
    just for things like "ch" but also for a base character + combining
    mark that does not have a precomposed form in the standard.

    Mike



    This archive was generated by hypermail 2.1.5 : Mon Sep 24 2007 - 08:02:34 CDT