L2/09-232R
From: Mark Davis
Date: 2009-08-14 (revised)
Proposal
- In U5.2, add a section to UAX#29 describing that
the following characters are common candidates for
tailoring to add to MidLetter.
-
[\-\u058A\u1806\u2010\u2011\u2E17\u30A0\uFE63\uFF0D][\u058A\u0F0B\u30A0\u30FB]
U+002D
( - ) HYPHEN-MINUS
U+058A
( ֊ ) ARMENIAN HYPHEN
U+0F0B
( ་ ) TIBETAN MARK
INTERSYLLABIC TSHEG
U+1806
( ᠆ ) MONGOLIAN TODO SOFT
HYPHEN
U+2010
( ‐ ) HYPHEN
U+2011
( ‑ ) NON-BREAKING HYPHEN
U+2E17
( ⸗ ) DOUBLE OBLIQUE HYPHEN
U+30A0
( ゠ ) KATAKANA-HIRAGANA DOUBLE
HYPHEN
U+30FB
( ・ ) KATAKANA MIDDLE DOT
U+FE63
( ﹣ ) SMALL HYPHEN-MINUS
U+FF0D
( - ) FULLWIDTH HYPHEN-MINUS
- In U5.2, add a section to UAX#29 discussing name
validation characters, and giving guidelines for
usage adding characters to the word characters
allowed above.
- In U5.2, add pointers in UAX#31 and #29
wordbreak and text indicating the relationship
between identifiers and words (and that the
character sets are not the same).
- Add a test for consistency between the WB
properties and Table 3 (with the known exceptions)
to the invariant tests.
====
For entry field validation, implementations sometimes
need to know which characters can occur in personal
names. While it is a bit fuzzy exactly what this means,
they want to distinguish between characters like those
in "James Smith-Faley, Jr." and those in "!#@♥≠". Note
that it is important to be reasonably lenient: it is
extremely annoying for people not to be able to add
legitimate names, like "di Silva", because those names
have characters like
space.
Typically, these personal name validations should not be
language-specific; I might be using a website in a
language other than the one for my name, for example.
While a more sophisticated validation might use context
among characters, a basic validation just wants to know
"what characters can be part of names?". The text should
explain that:
- It is only a guideline, and may need tailoring
for different environments
- It is a lenient, non-language-specific set - for
language-specific characters one should see CLDR.
- Mention characters:
- [,.[:whitespace:]]
U+002C
( , ) COMMA
U+002E
( . ) FULL STOP
[:whitespace:]
- It includes characters that may not be
appropriate for identifiers, and those that would
not be parts of words.
- It does not include contextual tests
- Additional tests may be needed in cases where
security is at issue.
- The set can be narrowed if name fields are split
out. For example, "," may not be necessary if titles
are split out; if titles are not allowed, "." may
not be necessary.
- Word characters contains some characters that
may be part of words in a broad sense, such as "c:a"
in Swedish or a dictionary word containing
hyphenation points, that might not normally be part
of names.
- Explain the use of NFKC in name validation
Background
Information
Characters added by Word Boundaries
Basic Latin - ASCII punctuation and symbols
Latin 1 Supplement - Latin-1 punctuation and symbols
Greek
And Coptic - Punctuation
Hebrew
- Additional punctuation
U+05F3
( ׳ ) HEBREW PUNCTUATION GERESH
U+05F4
( ״ ) HEBREW PUNCTUATION
GERSHAYIM
General Punctuation - General punctuation
U+2018
( ‘ ) LEFT SINGLE QUOTATION MARK
U+2019
( ’ ) RIGHT SINGLE QUOTATION MARK
U+2024
( ․ ) ONE DOT LEADER
U+2027
( ‧ ) HYPHENATION POINT
U+203F
( ‿ ) UNDERTIE
U+2040
( ⁀ ) CHARACTER TIE
U+2054
( ⁔ ) INVERTED UNDERTIE
Vertical Forms - Glyphs for vertical variants
U+FE13
( ︓ ) PRESENTATION FORM FOR
VERTICAL COLON
CJK Compatibility Forms - Glyphs for vertical
variants
U+FE33
( ︳ ) PRESENTATION FORM FOR
VERTICAL LOW LINE
U+FE34
( ︴ ) PRESENTATION FORM FOR
VERTICAL WAVY LOW LINE
CJK Compatibility Forms - Overscores and underscores
Small Form Variants - Small form variants
Halfwidth And Fullwidth Forms - Fullwidth ASCII
variants
U+FF07
( ' ) FULLWIDTH APOSTROPHE
U+FF0E
( . ) FULLWIDTH FULL STOP
U+FF1A
( : ) FULLWIDTH COLON
U+FF3F
( _ ) FULLWIDTH LOW LINE
[\p{alpha}\p{WB=Extend}\p{WB=FO}\p{WB=LE}\p{WB=ML}\p{WB=MB}\p{WB=EX}-\p{cf}]
[[:L:][:Nl:][:Mn:][:Mc:][\u0027\u002D\u002E\u003A\u00B7\u058A\u05F3
\u05F4\u0F0B\u200C\u200D\u2010\u2019\u2027\u30A0\u30FB][:Pc:]
-[:Pattern_Syntax:]
-[:Pattern_White_Space:]]]
Here are Word characters minus Identifier characters.
Basic Latin - ASCII punctuation and symbols
Greek
And Coptic - Punctuation
Cyrillic - Historic miscellaneous
U+0488
( ҈ ) COMBINING CYRILLIC HUNDRED
THOUSANDS SIGN
U+0489
( ҉ ) COMBINING CYRILLIC MILLIONS
SIGN
Arabic - Koranic annotation signs
U+06DE
( ۞ ) ARABIC START OF RUB EL HIZB
General Punctuation - General punctuation
U+2018
( ‘ ) LEFT SINGLE QUOTATION MARK
U+2019
( ’ ) RIGHT SINGLE QUOTATION MARK
U+2024
( ․ ) ONE DOT LEADER
U+2027
( ‧ ) HYPHENATION POINT
Combining Diacritical Marks For Symbols - Enclosing
diacritics
U+20DD
( ⃝ ) COMBINING ENCLOSING CIRCLE
U+20DE
( ⃞ ) COMBINING ENCLOSING SQUARE
U+20DF
( ⃟ ) COMBINING ENCLOSING DIAMOND
U+20E0
( ⃠ ) COMBINING ENCLOSING CIRCLE
BACKSLASH
Combining Diacritical Marks For Symbols - Additional
enclosing diacritics
U+20E2
( ⃢ ) COMBINING ENCLOSING SCREEN
U+20E3
( ⃣ ) COMBINING ENCLOSING KEYCAP
U+20E4
( ⃤ ) COMBINING ENCLOSING UPWARD
POINTING TRIANGLE
Enclosed Alphanumerics - Circled Latin letters
U+24B6
( Ⓐ ) CIRCLED LATIN CAPITAL LETTER
A
..
U+24E9
( ⓩ ) CIRCLED LATIN SMALL LETTER Z
Supplemental Punctuation - Medievalist punctuation
Cyrillic Extended B - Combining numeric signs
U+A670
( ꙰ ) COMBINING CYRILLIC TEN
MILLIONS SIGN
U+A671
( ꙱ ) COMBINING CYRILLIC HUNDRED
MILLIONS SIGN
U+A672
( ꙲ ) COMBINING CYRILLIC THOUSAND
MILLIONS SIGN
Vertical Forms - Glyphs for vertical variants
U+FE13
( ︓ ) PRESENTATION FORM FOR
VERTICAL COLON
Small Form Variants - Small form variants
Halfwidth And Fullwidth Forms - Fullwidth ASCII
variants
U+FF07
( ' ) FULLWIDTH APOSTROPHE
U+FF0E
( . ) FULLWIDTH FULL STOP
U+FF1A
( : ) FULLWIDTH COLON
And the Identifier Characters minus the Word Characters
Armenian -
Punctuation
Tibetan -
Marks and signs
U+0F0B
( ་ ) TIBETAN MARK INTERSYLLABIC
TSHEG
Katakana - Katakana punctuation
U+30A0
( ゠ ) KATAKANA-HIRAGANA DOUBLE
HYPHEN
Katakana - Conjunction and length marks
U+30FB
( ・ ) KATAKANA MIDDLE DOT