From: Rick McGowan (rick@unicode.org)
Date: Wed Mar 05 2008 - 10:53:51 CST
There is a new draft of the Proposed Update to Unicode Standard Annex
#29 Unicode Text Segmentation, reflecting changes authorized in the
last UTC meeting:
* Sentence Segmentation. Revised the contents of SContinue,
characters that 'continue' a sentence.
* Word Segmentation. Added Newline, and rules WB3a and WB3b to break
words within other newline sequences
* Grapheme Cluster Segmentation.
* Added Prepend and rule GB9b to handle Thai and Lao.
* Major revision of Section 3 Grapheme Cluster Boundaries.
Includes change of name to extended grapheme cluster, clearer
distinction from legacy grapheme clusters, and significant
reordering and enhancement of the text
* Note that the GraphemeBreakTest file in the UCD now tests the
extended grapheme clusters, since it is the recommended choice.
The UAX document is at http://www.unicode.org/reports/tr29/tr29-12.html.
The data files are in http://www.unicode.org/Public/5.1.0/ucd/auxiliary/.
The HTML charts are at:
http://www.unicode.org/Public/5.1.0/ucd/auxiliary/GraphemeBreakTest-5.1.0d28.html
http://www.unicode.org/Public/5.1.0/ucd/auxiliary/WordBreakTest-5.1.0d26.html
http://www.unicode.org/Public/5.1.0/ucd/auxiliary/LineBreakTest-5.1.0d30.html
http://www.unicode.org/Public/5.1.0/ucd/auxiliary/SentenceBreakTest-5.1.0d26.html
(The d numbers may be updated over the next month, so if these links
don't work, go first to the directory.)
Unicode 5.1.0 is currently in the pre-publication phase and is due
for release at the end of March 2008. No more substantive changes
are planned, beyond those already approved by the Unicode Technical
Committee. However, if you have editorial comments on the text of
Unicode 5.1.0, including this document, please report via the online
reporting form (http://www.unicode.org/reporting.html).
This archive was generated by hypermail 2.1.5 : Wed Mar 05 2008 - 10:57:24 CST