Re: Processing of default ignorable code points

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Aug 05 2004 - 15:54:29 CDT

  • Next message: Peter Kirk: "Re: Processing of default ignorable code points"

    At 11:11 AM 8/5/2004, Peter Kirk wrote:
    >In TUS 4.0 Section 5.3, p.111, the following is stated of default
    >ignorable code points:
    >
    >>These characters are also ignored except with respect to specific,
    >>defined processes; for example, ZERO WIDTH NON-JOINER is ignored in
    >>collation. ... For more information, see Section 5.20, Default Ignorable
    >>Code Points.
    >
    >
    >But in Section 5.20, although there is a lot about rendering default
    >ignorable code points, there is no further information about any other
    >processing of them. The implication of that section seems to be that these
    >characters are intended to be ignored in rendering but not in other
    >processes such as collation.

    You are correct in that the default (!) behavior of these characters in
    all processing depends on the purposes of that process. For most Unicode
    defined processes (even, where these definitions themselves are 'default'
    definitions) the behavior of all characters is in fact defined by the
    combination of their relevant Unicode properties and the rules for the
    published algorithm.

    Rendering is special in that we do *not* provide a general algorithm, so if
    we intend a specific default behavior, it needs to be stated in the text.

    > Is this or the summary in Section 5.3 in fact to be taken as the
    > intention of the standard? Has the summary simply not been updated for
    > consistency with the fuller details? Or has the fuller description been
    > unintentionally restricted to rendering?

    The summary is correct.

    >Is it in fact the intention that all default ignorable characters must
    >always be ignored in collation? Or is it possible to tailor collation not
    >to ignore them? The collation algorithm seems to suggest the latter, in
    >that there seems to be no mention of these characters being obligatorily
    >ignored - although I presume they have zero weight by default (in DUCET).

    Correct. By the way, these characters are called Default_Ignorable, and not
    Must_Ignore for a reason. You are always free to tailor things so that they
    are not ignored. Even in rendering the tailoring is the 'show controls'
    mode, which would make some or all of these characters visible.

    >This has some quite serious implication for processing of texts including
    >ZW(N)J, variation selectors etc.

    How these characters are treated is important, but there isn't as much of
    an issue here as you make it out to be.

    A./

    Other relevant sources of text about these are

    UCD.html:

    For programmatic determination of default-ignorable code points. New
    characters that should be ignored in processing (unless explicitly
    supported) will be assigned in these ranges, permitting programs to
    correctly handle the default behavior of such characters when not otherwise
    supported. For more information, see
    <http://www.unicode.org/reports/tr29/>UAX #29: Text Boundaries.

    with no mention of default-ignorable in the text of that UAX. (I've just
    filed a web-report on that).



    This archive was generated by hypermail 2.1.5 : Thu Aug 05 2004 - 15:57:14 CDT