From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Aug 05 2004 - 15:54:29 CDT
At 11:11 AM 8/5/2004, Peter Kirk wrote:
>In TUS 4.0 Section 5.3, p.111, the following is stated of default
>ignorable code points:
>
>>These characters are also ignored except with respect to specific,
>>defined processes; for example, ZERO WIDTH NON-JOINER is ignored in
>>collation. ... For more information, see Section 5.20, Default Ignorable
>>Code Points.
>
>
>But in Section 5.20, although there is a lot about rendering default
>ignorable code points, there is no further information about any other
>processing of them. The implication of that section seems to be that these
>characters are intended to be ignored in rendering but not in other
>processes such as collation.
You are correct in that the default (!) behavior of these characters in
all processing depends on the purposes of that process. For most Unicode
defined processes (even, where these definitions themselves are 'default'
definitions) the behavior of all characters is in fact defined by the
combination of their relevant Unicode properties and the rules for the
published algorithm.
Rendering is special in that we do *not* provide a general algorithm, so if
we intend a specific default behavior, it needs to be stated in the text.
> Is this or the summary in Section 5.3 in fact to be taken as the
> intention of the standard? Has the summary simply not been updated for
> consistency with the fuller details? Or has the fuller description been
> unintentionally restricted to rendering?
The summary is correct.
>Is it in fact the intention that all default ignorable characters must
>always be ignored in collation? Or is it possible to tailor collation not
>to ignore them? The collation algorithm seems to suggest the latter, in
>that there seems to be no mention of these characters being obligatorily
>ignored - although I presume they have zero weight by default (in DUCET).
Correct. By the way, these characters are called Default_Ignorable, and not
Must_Ignore for a reason. You are always free to tailor things so that they
are not ignored. Even in rendering the tailoring is the 'show controls'
mode, which would make some or all of these characters visible.
>This has some quite serious implication for processing of texts including
>ZW(N)J, variation selectors etc.
How these characters are treated is important, but there isn't as much of
an issue here as you make it out to be.
A./
Other relevant sources of text about these are
UCD.html:
For programmatic determination of default-ignorable code points. New
characters that should be ignored in processing (unless explicitly
supported) will be assigned in these ranges, permitting programs to
correctly handle the default behavior of such characters when not otherwise
supported. For more information, see
<http://www.unicode.org/reports/tr29/>UAX #29: Text Boundaries.
with no mention of default-ignorable in the text of that UAX. (I've just
filed a web-report on that).
This archive was generated by hypermail 2.1.5 : Thu Aug 05 2004 - 15:57:14 CDT