Hello,
Re #1, the ^ symbol indeed denotes a start-of-line anchor, in usual regex notation, and the corresponding rules could use sot instead.
Re #2, that was an oversight, and will be addressed in the Proposed Update of UAX #29 for Unicode 10.0.
Re #3 and #4, both were addressed before the release of Version 9.0.
For suggestions such as #1, which require review by the UTC, please remember to use the feedback reporting form.
Thank you,
L.
-----Original Message-----
From: Unicode [mailto:unicode-bounces_at_unicode.org] On Behalf Of Daniel Bünzli
Sent: Tuesday, June 21, 2016 9:02 AM
To: Unicode Public <unicode_at_unicode.org>
Subject: UAX 29 9.0.0 new emoji flag rules questions and comments
I have a few questions/comments about the new emoji segmentation rules in 9.0.0
1. I have trouble understanding what the ^ symbol means in these rules:
http://www.unicode.org/reports/tr29/proposed.html#GB8a
http://www.unicode.org/reports/tr29/proposed.html#WB15
does it correspond to the regexp SOL symbol ? If that is the case SOL is a bit ambiguous in that context it could also mean that you need to match start of lines which is a whole different business. Couldn't that simply be replaced by sot ?
2. Besides given that with GB8* rules you need to be able to count an odd number of RI, it seems to me that the sentence "Grapheme cluster boundaries can be easily tested by looking at immediately adjacent characters." is no longer accurate.
3. There are two rules named GB8c.
4. In §1.1 the link to UTS18 is broken (#RegEx does not exist in UAX 41).
Best,
Daniel
Received on Tue Jun 21 2016 - 19:32:56 CDT
This archive was generated by hypermail 2.2.0 : Tue Jun 21 2016 - 19:32:56 CDT