From: Mark Davis (mark.davis@icu-project.org)
Date: Tue Aug 14 2007 - 20:20:12 CDT
The characters that cause problems are not a fixed list; they are
programmatically detected and removed when the data is derived. This is done
by looking at the NFKC version of each character that qualifies as part of
an identifier (according to that particular version of Unicode) and removing
the character from XID when the result would not be consistent. That is,
1. when the character is XID_Start, the NFKC sequence has to be
<XID_Start XID_Continue*> or the character is removed
2. when the character is XID_Continue, the NFKC sequence has to be
<XID_Continue+> or the character is removed.
Particular characters may be re-added (grandfathered) to ID to make the
definition be backwards compatible.
Does that help?
Mark
On 8/14/07, "Martin v. Löwis" <martin@v.loewis.de> wrote:
>
> I'm trying to locate the precise specification for the
> XID_Start and XID_Continue properties. According to
>
> http://unicode.org/Public/UNIDATA/UCD.html
>
> they are derived properties, so there should be an
> algorithm somewhere describing how the are computed
> (given other properties). The UCD says that the
> specification is in UAX#31, which says I should
> read
>
> http://unicode.org/reports/tr31/#NFKC_Modifications
>
> However, looking at 5.1, I cannot find a precise
> specification of these properties. For example,
> 5.1.2 says "Certain characters...", but does not
> seem to provide a complete list of such characters.
> It ends with "In particular, the following four
> characters...". Again, that reads like an example -
> is it meant as a complete specification?
>
> Likewise, 5.1.3 talks about "certain Arabic presentation
> forms", without giving a complete list which precisely
> are excluded from XID_Start and XID_Continue.
>
> Any insights appreciated,
>
> Martin
>
>
-- Mark
This archive was generated by hypermail 2.1.5 : Tue Aug 14 2007 - 20:22:31 CDT