From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Aug 05 2004 - 13:11:55 CDT
In TUS 4.0 Section 5.3, p.111, the following is stated of default
ignorable code points:
> These characters are also ignored except with respect to specific,
> defined processes; for example, ZERO WIDTH NON-JOINER is ignored in
> collation. ... For more information, see Section 5.20, Default
> Ignorable Code Points.
But in Section 5.20, although there is a lot about rendering default
ignorable code points, there is no further information about any other
processing of them. The implication of that section seems to be that
these characters are intended to be ignored in rendering but not in
other processes such as collation. Is this or the summary in Section 5.3
in fact to be taken as the intention of the standard? Has the summary
simply not been updated for consistency with the fuller details? Or has
the fuller description been unintentionally restricted to rendering?
Is it in fact the intention that all default ignorable characters must
always be ignored in collation? Or is it possible to tailor collation
not to ignore them? The collation algorithm seems to suggest the latter,
in that there seems to be no mention of these characters being
obligatorily ignored - although I presume they have zero weight by
default (in DUCET).
This has some quite serious implication for processing of texts
including ZW(N)J, variation selectors etc.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Thu Aug 05 2004 - 13:13:35 CDT