UAX 31 Changes

L2/09-109

From: Mark Davis
Date: 2009-3-28

I suggest the following changes in UAX 31.

1. Fix ambiguous variables

There are suggested rules for using ZWJ and ZWNJ in http://unicode.org/draft/reports/tr31/tr31.html#Layout_and_Format_Control_Characters

In those rules, we use the variable $L for two different entities in the rules: Left Joining, and Letter (for Indic). While they are in separate contexts, it would be much clearer if we didn't have the overlap. There are a few possible alternatives; I suggest:

For the Joining specifications of ZWJ/ZWNJ, change $L, $R to $LJ, $RJ

2. Add Default Ignorable Code Points to Table 4 Candidate Characters for Exclusion from Identifiers

In http://unicode.org/draft/reports/tr31/tr31.html#Specific_Character_Adjustments,

add a row:

[:Default_Ignorable_Code_Point=True:] Default Ignorable Code Points (See Section 2.3)

[Rationale: we already say that DIs should be excluded, with certain exceptions in Section 2.3, which has a lot of detail on the topic. This just makes that relationship more visible.]

3. Add Unicode 5.2 Characters to Table 3/4 (Candidates for Inclusion/Exclusion)

Add to Table 4 (Exclusion) the following scripts (this is a rough cut, so feedback is welcome):

Archaic / Historic

Old Turkic
Old South Arabian
Imperial Aramaic
Inscriptional Parthian
Inscriptional Pahlavi
Avestan
Egyptian Hieroglyphs
Javanese

Limited Use

Samaritan
Kaithi
Tai Viet
Bamum
Lisu

Add the following to Table 5. Recommended Scripts

Meetei Mayek
Tai Tham

4. Add `U+0640` ( ‎ـ‎ ) ARABIC TATWEEL as a candidate character for exclusion.

We have the following tables in http://unicode.org/draft/reports/tr31/tr31.html#Specific_Character_Adjustments

Table 3. Candidate Characters for Inclusion in Identifiers
Table 4. Candidate Characters for Exclusion from Identifiers

A. I suggest adding a row to Table 4, being

[\u0640] Arabic Tatweel

B. Alternatively, one could break Table 4 into two tables:

Table 4a. Candidate Characters Identified by CodePointfor Exclusion from Identifiers

Containing only Tatweel

Table 4b. Candidate Characters Identified by Property for Exclusion from Identifiers

Containing the current Table 4 contents

(Ken favors a two table solution; I think it is simpler with one.)

5. Add Characters from IDNA Tables Document

The IDNA tables document (draft) contains certain exceptions that we should review, in http://tools.ietf.org/html/draft-ietf-idnabis-tables#section-2.6.

The following characters are not in the Unicode identifier definition XID_Continue (after subtracting characters that are affected by case folding and NFKC), nor are in the Candidates for Inclusion.

Greek And Coptic - Numeral signs
U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN

Arabic - Signs for Sindhi
U+06FD ( ‎۽‎ ) ARABIC SIGN SINDHI AMPERSAND
U+06FE ( ‎۾‎ ) ARABIC SIGN SINDHI POSTPOSITION MEN

Tibetan - Marks and signs
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG

Katakana - Conjunction and length marks
U+30FB ( ・ ) KATAKANA MIDDLE DOT

Of them, I'd recommend that we add


						
						U+30FB

( ・ ) KATAKANA MIDDLE DOT to Table 3. Candidate Characters for Inclusion in Identifiers, since it serves a function somewhat like an underbar. The others have gotten into the IDNA specification (draft), but there doesn't seem to be any compelling rationale for that. However, others may know more about them and present good reasons for inclusion into UAX#31.

Note that the following is part of Pattern_Syntax, and thus not part of XID_Continue. Pattern_Syntax is immutable, and required to be disjoint from identifiers, and yet this character was added in that range, which was probably a mistake.

Supplemental Punctuation - Medievalist punctuation
U+2E2F ( ⸯ ) VERTICAL TILDE

Of the characters that Unicode has, and IDNA doesn't, I don't see any need to make any changes. Some of them are principled differences, like the omission of connector punctuation, and others are not, like the omission of Hangul Jamo.

5.1 Background

For completeness, the following lists the exceptions in the 05 version of that document, organized by type.

*PVALID: // would otherwise have been DISALLOWED

   00DF; PVALID     # LATIN SMALL LETTER SHARP S
   03C2; PVALID     # GREEK SMALL LETTER FINAL SIGMA
   06FD; PVALID     # ARABIC SIGN SINDHI AMPERSAND
   06FE; PVALID     # ARABIC SIGN SINDHI POSTPOSITION MEN
   0F0B; PVALID     # TIBETAN MARK INTERSYLLABIC TSHEG
   3007; PVALID     # IDEOGRAPHIC NUMBER ZERO

*CONTEXTO: // would otherwise have been DISALLOWED
   00B7; CONTEXTO   # MIDDLE DOT
   0375; CONTEXTO   # GREEK LOWER NUMERAL SIGN (KERAIA)
   05F3; CONTEXTO   # HEBREW PUNCTUATION GERESH
   05F4; CONTEXTO   # HEBREW PUNCTUATION GERSHAYIM
   30FB; CONTEXTO   # KATAKANA MIDDLE DOT

*CONTEXTO: // would otherwise have been PVALID
   002D; CONTEXTO   # HYPHEN-MINUS
   02B9; CONTEXTO   # MODIFIER LETTER PRIME
   0660; CONTEXTO   # ARABIC-INDIC DIGIT ZERO
   0661; CONTEXTO   # ARABIC-INDIC DIGIT ONE
   0662; CONTEXTO   # ARABIC-INDIC DIGIT TWO
   0663; CONTEXTO   # ARABIC-INDIC DIGIT THREE
   0664; CONTEXTO   # ARABIC-INDIC DIGIT FOUR
   0665; CONTEXTO   # ARABIC-INDIC DIGIT FIVE
   0666; CONTEXTO   # ARABIC-INDIC DIGIT SIX
   0667; CONTEXTO   # ARABIC-INDIC DIGIT SEVEN
   0668; CONTEXTO   # ARABIC-INDIC DIGIT EIGHT
   0669; CONTEXTO   # ARABIC-INDIC DIGIT NINE
   06F0; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT ZERO
   06F1; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT ONE
   06F2; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT TWO
   06F3; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT THREE
   06F4; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT FOUR
   06F5; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT FIVE
   06F6; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT SIX
   06F7; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT SEVEN
   06F8; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT EIGHT
   06F9; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT NINE
   0483; CONTEXTO   # COMBINING CYRILLIC TITLO
   3005; CONTEXTO   # IDEOGRAPHIC ITERATION MARK
   303B; CONTEXTO   # VERTICAL IDEOGRAPHIC ITERATION MARK

*DISALLOWED: // would otherwise have been PVALID
   302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK
   302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK

5.2 Characters in IDNA draft

Here is the current set, as of the current draft and Unicode 5.1. You can paste into http://unicode.org/cldr/utility/list-unicodeset.jsp to explore, or compare against XID_Continue.

[\-0-9a-z·ß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĵķĸĺļľłńņňŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżžƀƃƅƈƌƍƒƕƙ -ƛƞơƣƥƨƪƫƭưƴƶƹ-ƻƽ-ǃǎǐǒǔǖǘǚǜǝǟǡǣǥǧǩǫǭǯǰǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿɀɂɇɉɋɍɏ -ʯʹ-ˁˆ-ˑˬˮ̀-̿͂͆-͎͐-ͯͱͳ͵ͷͻ-ͽΐά-ώϗϙϛϝϟϡϣϥϧϩϫϭϯϳϸϻϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁ҃-҇ҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣՙա -ֆ֑-ׇֽֿׁׂׅׄא-תװ-״ؐ-ؚء-ٞ٠-٩ٮ-ٴٹ-ۓە-ۜ۟-۪ۨ-ۿܐ-݊ݍ-ޱ߀-ߵߺँ-ह़-्ॐ-॔ॠ-ॣ०-९ॱॲॻ-ॿঁ- ঃঅ-ঌএঐও-নপ-রলশ-হ়-ৄেৈো-ৎৗৠ-ৣ০-ৱਁ-ਃਅ-ਊਏਐਓ-ਨਪ-ਰਲਵਸਹ਼ਾ-ੂੇੈੋ-੍ੑੜ੦-ੵઁ-ઃઅ-ઍએ-ઑઓ -નપ-રલળવ-હ઼-ૅે-ૉો-્ૐૠ-ૣ૦-૯ଁ-ଃଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହ଼-ୄେୈୋ-୍ୖୗୟ-ୣ୦-୯ୱஂஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந -பம-ஹா-ூெ-ைொ-்ௐௗ௦-௯ఁ-ఃఅ-ఌఎ-ఐఒ-నప-ళవ-హఽ-ౄె-ైొ-్ౕౖౘౙౠ-ౣ౦-౯ಂಃಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹ಼ -ೄೆ-ೈೊ-್ೕೖೞೠ-ೣ೦-೯ംഃഅ-ഌഎ-ഐഒ-നപ-ഹഽ-ൄെ-ൈൊ-്ൗൠ-ൣ൦-൯ൺ-ൿංඃඅ-ඖක-නඳ-රලව-ෆ්ා-ුූෘ-ෟෲෳ ก-าิ-ฺเ-๎๐-๙ກຂຄງຈຊຍດ-ທນ-ຟມ-ຣລວສຫອ-າິ-ູົ-ຽເ-ໄໆ່-ໍ໐-໙ༀ་༘༙༠-༩༹༵༷༾-གང-ཇཉ-ཌཎ-དན -བམ-ཛཝ-ཨཪ-ཬཱིེུ-ྀྂ-྄྆-ྋྐ-ྒྔ-ྗྙ-ྜྞ-ྡྣ-ྦྨ-ྫྭ-ྸྺ-ྼ࿆က-၉ၐ-႙ა-ჺሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ -ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚ፟ᎀ-ᎏᎠ-Ᏼᐁ-ᙬᙯ-ᙶᚁ-ᚚᚠ-ᛪᜀ-ᜌᜎ-᜔ᜠ-᜴ᝀ-ᝓᝠ-ᝬᝮ-ᝰᝲᝳក-ឳា-៓ៗៜ៝០ -៩᠐-᠙ᠠ-ᡷᢀ-ᢪᤀ-ᤜᤠ-ᤫᤰ-᤻᥆-ᥭᥰ-ᥴᦀ-ᦩᦰ-ᧉ᧐-᧙ᨀ-ᨛᬀ-ᭋ᭐-᭙᭫-᭳ᮀ-᮪ᮮ-᮹ᰀ-᰷᱀-᱉ᱍ-ᱽᴀ-ᴫᴯᴻᵎᵫ-ᵷᵹ- ᶚ᷀-᷿ᷦ᷾ḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ -ẙẜẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰὲὴὶὸὺὼᾰᾱᾶῆῐ -ῒῖῗῠ-ῢῤ-ῧῶ‌‍ⅎↄⰰ-ⱞⱡⱥⱦⱨⱪⱬⱱⱳⱴⱶ-ⱻⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣⳤⴀ -ⴥⴰ-ⵥⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⷠ-ⷿⸯ々-〇〪-〭〱-〵〻〼ぁ-ゖ゙゚ゝゞァ-ヾㄅ-ㄭㆠ-ㆷㇰ-ㇿ㐀-䶵一-鿃ꀀ-ꒌꔀ-ꘌꘐ-ꘫꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙣꙥꙧꙩꙫꙭ-꙯꙼꙽ꙿꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꜗ-ꜟꜣꜥꜧꜩꜫꜭꜯ- ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯꝱ-ꝸꝺꝼꝿꞁꞃꞅꞇꞈꞌꟻ-ꠧꡀ-ꡳꢀ-꣄꣐-꣙꤀-꤭ꤰ-꥓ꨀ-ꨶꩀ-ꩍ꩐-꩙가 -힣﨎﨏﨑﨓﨔﨟﨡﨣﨤﨧-﨩ﬞ︠-︦ﹳ𐀁-𐀋𐀍-𐀦𐀨-𐀺𐀼𐀽𐀿-𐁍𐁐-𐁝𐂀-𐃺𐇽𐊀-𐊜𐊠-𐋐𐌀-𐌞𐌰 -𐍀𐍂-𐍉𐎀-𐎝𐎠-𐏃𐏈-𐏏𐐨-𐒝𐒠-𐒩𐠀-𐠅𐠈𐠊-𐠵𐠷𐠸𐠼𐠿𐤀-𐤕𐤠-𐤹𐨀-𐨃𐨅𐨆𐨌-𐨓𐨕-𐨗𐨙-𐨳𐨸-𐨿𐨺𒀀-𒍮𠀀-𪛖]