Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Thu Sep 20 2007 - 12:11:35 CDT

  • Next message: Mark Davis: "Re: Normalization in panlingual application"

    > Issue #111 Proposed Update UAX #18: Unicode Regular Expressions
    >
    > http://www.unicode.org/reports/tr18/tr18-12.html
    >
    > This proposed update clarifies conformance requirements for "." and CRLF.
    > Public feedback is invited.

    I disagree with the MUSTs in the proposed text. In my implementation,
    whether "." matches newline sequences is independent of "multiline
    mode." Multiline mode affects the behavior of ^ and $, not .; in
    single line mode, they match only at the beginning or end of the text
    (or just before a final newline sequence); in multiline mode, ^ matches
    at the beginning of the string or after any newline sequence, and $
    matches before any newline sequence or at the end of the string.

    You can turn on the DotMatchesNewline and MultilineMatching options
    separately. As a side note, I implemented "." to match a default
    grapheme cluster, so A + ACUTE is treated as a single entity, and
    Hangul syllables are kept together (you can also specify them using
    \L+\V+\T* if you want). There is also a DotMatchesDefective option
    (true by default) which determines whether . will match a defective
    combining character sequence (or you can look specifically for
    defective sequences using \F).

    > If you have comments for official UTC consideration, please post them by
    > submitting your comments through our feedback & reporting page:
    >
    > http://www.unicode.org/reporting.html

    A few months ago I reported a problem with UAX #18 using this page,
    but I never received any confirmation other than that the web server
    apparently accepted my message. The problem I reported was not
    changed in this new update, so I don't have a lot of confidence in
    this method of reporting problems. Here is what I submitted:

    In Section 2.2 which discusses Default Grapheme Clusters, it says:

         A typical implementation of the inverse of a set containing
         literal clusters simply removes those strings, thus
         [^a-z ñ \q{ch} \q{ll} \q{rr}] is equivalent to [^a-z ñ].

    I think this is bad implementation advice, and leads to strange
    behavior. In the example given, the behavior will be correct since
    all of the clusters begin with a letter also contained in the class.
    However, if you consider a character class containing only clusters,
    e.g. [^\q{ch} \q{ll} \q{rr}], simply removing the clusters will
    result in an empty character class that matches -anything-. This
    is incorrect behavior as it should not match the beginning of the
    word "chile" for instance.

    The way I implemented this was to create a "normal" character class
    containing all the listed characters and grapheme clusters, and
    then invert the result of the match operation. The classes above
    would match "chile" in the first position, and thus return a "no
    match" result.

    Mike



    This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 12:16:04 CDT