From: Mark Davis
Date: 2009-3-28
I suggest the
following changes in UAX 31.
1. Fix
ambiguous variables
There are suggested rules for using ZWJ and ZWNJ in
http://unicode.org/draft/reports/tr31/tr31.html#Layout_and_Format_Control_Characters
In those rules, we use the variable $L for two
different entities in the rules: Left Joining, and
Letter (for Indic). While they are in separate contexts,
it would be much clearer if we didn't have the overlap.
There are a few possible alternatives; I suggest:
- For the Joining specifications of ZWJ/ZWNJ,
change $L, $R to $LJ, $RJ
2. Add Default Ignorable Code Points to Table
4 Candidate Characters for Exclusion from Identifiers
In
http://unicode.org/draft/reports/tr31/tr31.html#Specific_Character_Adjustments,
add a row:
[:Default_Ignorable_Code_Point=True:] Default
Ignorable Code Points (See Section 2.3)
[Rationale: we already say that DIs should be
excluded, with certain exceptions in Section 2.3, which
has a lot of detail on the topic. This just makes that
relationship more visible.]
3. Add Unicode 5.2 Characters to Table 3/4
(Candidates for Inclusion/Exclusion)
Add to Table 4 (Exclusion) the following scripts (this
is a rough cut, so feedback is welcome):
Archaic / Historic
- Old Turkic
- Old South Arabian
- Imperial Aramaic
- Inscriptional Parthian
- Inscriptional Pahlavi
- Avestan
- Egyptian Hieroglyphs
- Javanese
Limited Use
- Samaritan
- Kaithi
- Tai Viet
- Bamum
- Lisu
Add the following to
Table 5.
Recommended Scripts
4. Add
U+0640
( ـ ) ARABIC TATWEEL as a candidate
character for exclusion.
We have the following tables in
http://unicode.org/draft/reports/tr31/tr31.html#Specific_Character_Adjustments
- Table 3. Candidate Characters for Inclusion in
Identifiers
- Table 4. Candidate Characters for Exclusion from
Identifiers
A. I suggest adding a row to Table 4, being
[\u0640] Arabic Tatweel
B. Alternatively, one could break Table 4 into two
tables:
Table 4a. Candidate Characters Identified
by CodePointfor Exclusion from Identifiers
Containing only Tatweel
Table 4b. Candidate Characters Identified by
Property for Exclusion from Identifiers
Containing the current Table 4 contents
(Ken favors a two table solution; I think it is
simpler with one.)
5. Add Characters from IDNA Tables Document
The IDNA tables document (draft) contains certain
exceptions that we should review, in
http://tools.ietf.org/html/draft-ietf-idnabis-tables#section-2.6.
The following characters are not in the Unicode
identifier definition XID_Continue (after subtracting
characters that are affected by case folding and NFKC),
nor are in the Candidates for Inclusion.
Greek And Coptic - Numeral signs
U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN
Arabic - Signs for Sindhi
U+06FD ( ۽ ) ARABIC SIGN SINDHI AMPERSAND
U+06FE ( ۾ ) ARABIC SIGN SINDHI POSTPOSITION
MEN
Tibetan - Marks and signs
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG
Katakana - Conjunction and length marks
U+30FB ( ・ ) KATAKANA MIDDLE DOT
Of them, I'd recommend that we add
U+30FB
( ・ ) KATAKANA MIDDLE DOT to Table 3.
Candidate Characters for Inclusion in Identifiers, since
it serves a function somewhat like an underbar. The
others have gotten into the IDNA specification (draft),
but there doesn't seem to be any compelling rationale
for that. However, others may know more about them and
present good reasons for inclusion into UAX#31.
Note that the following is part of Pattern_Syntax, and
thus not part of XID_Continue. Pattern_Syntax is
immutable, and required to be disjoint from identifiers,
and yet this character was added in that range, which
was probably a mistake.
Supplemental Punctuation - Medievalist punctuation
U+2E2F ( ⸯ ) VERTICAL TILDE
Of the characters that Unicode has, and IDNA
doesn't, I don't see any need to make any changes. Some
of them are principled differences, like the omission of
connector punctuation, and others are not, like the
omission of Hangul Jamo.
5.1 Background
For completeness, the following lists the exceptions in
the 05 version of that document, organized by type.
*PVALID: // would otherwise have been DISALLOWED
00DF; PVALID # LATIN SMALL LETTER SHARP S
03C2; PVALID # GREEK SMALL LETTER FINAL SIGMA
06FD; PVALID # ARABIC SIGN SINDHI AMPERSAND
06FE; PVALID # ARABIC SIGN SINDHI POSTPOSITION MEN
0F0B; PVALID # TIBETAN MARK INTERSYLLABIC TSHEG
3007; PVALID # IDEOGRAPHIC NUMBER ZERO
*CONTEXTO: // would otherwise have been DISALLOWED
00B7; CONTEXTO # MIDDLE DOT
0375; CONTEXTO # GREEK LOWER NUMERAL SIGN (KERAIA)
05F3; CONTEXTO # HEBREW PUNCTUATION GERESH
05F4; CONTEXTO # HEBREW PUNCTUATION GERSHAYIM
30FB; CONTEXTO # KATAKANA MIDDLE DOT
*CONTEXTO: // would otherwise have been PVALID
002D; CONTEXTO # HYPHEN-MINUS
02B9; CONTEXTO # MODIFIER LETTER PRIME
0660; CONTEXTO # ARABIC-INDIC DIGIT ZERO
0661; CONTEXTO # ARABIC-INDIC DIGIT ONE
0662; CONTEXTO # ARABIC-INDIC DIGIT TWO
0663; CONTEXTO # ARABIC-INDIC DIGIT THREE
0664; CONTEXTO # ARABIC-INDIC DIGIT FOUR
0665; CONTEXTO # ARABIC-INDIC DIGIT FIVE
0666; CONTEXTO # ARABIC-INDIC DIGIT SIX
0667; CONTEXTO # ARABIC-INDIC DIGIT SEVEN
0668; CONTEXTO # ARABIC-INDIC DIGIT EIGHT
0669; CONTEXTO # ARABIC-INDIC DIGIT NINE
06F0; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ZERO
06F1; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ONE
06F2; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT TWO
06F3; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT THREE
06F4; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FOUR
06F5; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FIVE
06F6; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SIX
06F7; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SEVEN
06F8; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT EIGHT
06F9; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT NINE
0483; CONTEXTO # COMBINING CYRILLIC TITLO
3005; CONTEXTO # IDEOGRAPHIC ITERATION MARK
303B; CONTEXTO # VERTICAL IDEOGRAPHIC ITERATION MARK
*DISALLOWED: // would otherwise have been PVALID
302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK
302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK
5.2
Characters in IDNA draft
Here is the current set, as of the current draft and
Unicode 5.1. You can paste into http://unicode.org/cldr/utility/list-unicodeset.jsp
to explore, or compare against XID_Continue.
[\-0-9a-z·ß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĵķĸĺļľłńņňŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżžƀƃƅƈƌƍƒƕƙ
-ƛƞơƣƥƨƪƫƭưƴƶƹ-ƻƽ-ǃǎǐǒǔǖǘǚǜǝǟǡǣǥǧǩǫǭǯǰǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿɀɂɇɉɋɍɏ
-ʯʹ-ˁˆ-ˑˬˮ̀-̿͂͆-͎͐-ͯͱͳ͵ͷͻ-ͽΐά-ώϗϙϛϝϟϡϣϥϧϩϫϭϯϳϸϻϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁ҃-҇ҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣՙա
-ֆ֑-ׇֽֿׁׂׅׄא-תװ-״ؐ-ؚء-ٞ٠-٩ٮ-ٴٹ-ۓە-ۜ۟-۪ۨ-ۿܐ-݊ݍ-ޱ߀-ߵߺँ-ह़-्ॐ-॔ॠ-ॣ०-९ॱॲॻ-ॿঁ-
ঃঅ-ঌএঐও-নপ-রলশ-হ়-ৄেৈো-ৎৗৠ-ৣ০-ৱਁ-ਃਅ-ਊਏਐਓ-ਨਪ-ਰਲਵਸਹ਼ਾ-ੂੇੈੋ-੍ੑੜ੦-ੵઁ-ઃઅ-ઍએ-ઑઓ
-નપ-રલળવ-હ઼-ૅે-ૉો-્ૐૠ-ૣ૦-૯ଁ-ଃଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହ଼-ୄେୈୋ-୍ୖୗୟ-ୣ୦-୯ୱஂஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந
-பம-ஹா-ூெ-ைொ-்ௐௗ௦-௯ఁ-ఃఅ-ఌఎ-ఐఒ-నప-ళవ-హఽ-ౄె-ైొ-్ౕౖౘౙౠ-ౣ౦-౯ಂಃಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹ಼
-ೄೆ-ೈೊ-್ೕೖೞೠ-ೣ೦-೯ംഃഅ-ഌഎ-ഐഒ-നപ-ഹഽ-ൄെ-ൈൊ-്ൗൠ-ൣ൦-൯ൺ-ൿංඃඅ-ඖක-නඳ-රලව-ෆ්ා-ුූෘ-ෟෲෳ
ก-าิ-ฺเ-๎๐-๙ກຂຄງຈຊຍດ-ທນ-ຟມ-ຣລວສຫອ-າິ-ູົ-ຽເ-ໄໆ່-ໍ໐-໙ༀ་༘༙༠-༩༹༵༷༾-གང-ཇཉ-ཌཎ-དན
-བམ-ཛཝ-ཨཪ-ཬཱིེུ-ྀྂ-྄྆-ྋྐ-ྒྔ-ྗྙ-ྜྞ-ྡྣ-ྦྨ-ྫྭ-ྸྺ-ྼ࿆က-၉ၐ-႙ა-ჺሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ
-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚ፟ᎀ-ᎏᎠ-Ᏼᐁ-ᙬᙯ-ᙶᚁ-ᚚᚠ-ᛪᜀ-ᜌᜎ-᜔ᜠ-᜴ᝀ-ᝓᝠ-ᝬᝮ-ᝰᝲᝳក-ឳា-៓ៗៜ៝០
-៩᠐-᠙ᠠ-ᡷᢀ-ᢪᤀ-ᤜᤠ-ᤫᤰ-᤻᥆-ᥭᥰ-ᥴᦀ-ᦩᦰ-ᧉ᧐-᧙ᨀ-ᨛᬀ-ᭋ᭐-᭙᭫-᭳ᮀ-᮪ᮮ-᮹ᰀ-᰷᱀-᱉ᱍ-ᱽᴀ-ᴫᴯᴻᵎᵫ-ᵷᵹ-
ᶚ᷀-᷿ᷦ᷾ḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ
-ẙẜẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰὲὴὶὸὺὼᾰᾱᾶῆῐ
-ῒῖῗῠ-ῢῤ-ῧῶⅎↄⰰ-ⱞⱡⱥⱦⱨⱪⱬⱱⱳⱴⱶ-ⱻⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣⳤⴀ
-ⴥⴰ-ⵥⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⷠ-ⷿⸯ々-〇〪-〭〱-〵〻〼ぁ-ゖ゙゚ゝゞァ-ヾㄅ-ㄭㆠ-ㆷㇰ-ㇿ㐀-䶵一-鿃ꀀ-ꒌꔀ-ꘌꘐ-ꘫꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙣꙥꙧꙩꙫꙭ-꙯꙼꙽ꙿꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꜗ-ꜟꜣꜥꜧꜩꜫꜭꜯ-
ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯꝱ-ꝸꝺꝼꝿꞁꞃꞅꞇꞈꞌꟻ-ꠧꡀ-ꡳꢀ-꣄꣐-꣙꤀-꤭ꤰ-꥓ꨀ-ꨶꩀ-ꩍ꩐-꩙가
-힣﨎﨏﨑﨓﨔﨟﨡﨣﨤﨧-﨩ﬞ︠-︦ﹳ𐀁-𐀋𐀍-𐀦𐀨-𐀺𐀼𐀽𐀿-𐁍𐁐-𐁝𐂀-𐃺𐇽𐊀-𐊜𐊠-𐋐𐌀-𐌞𐌰
-𐍀𐍂-𐍉𐎀-𐎝𐎠-𐏃𐏈-𐏏𐐨-𐒝𐒠-𐒩𐠀-𐠅𐠈𐠊-𐠵𐠷𐠸𐠼𐠿𐤀-𐤕𐤠-𐤹𐨀-𐨃𐨅𐨆𐨌-𐨓𐨕-𐨗𐨙-𐨳𐨸-𐨿𐨺𒀀-𒍮𠀀-𪛖]