Public Review Issue #12

The Unicode Technical Committee is looking for feedback on the common usage of certain punctuation characters; especially feedback from those familiar with non-Latin writing systems, including Arabic, Armenian, Syriac, Devanagari, Myanmar, and so on.

In Unicode 4.0.1, the new property Sentence_Terminal is being added. This property is to be used in the default Sentence boundaries in UAX #29 (Text Boundaries), instead of a list in the body of that document (under the heading "Term").

The set of characters with the property Sentence_Terminal should include all those characters that typically terminate a sentence. The boundaries of sentences may vary between languages, and even the notion of a "sentence" may not apply well to particular writing systems. However, because of the use of this notion in various query languages, it is important to have a default across all of Unicode that is as reasonable as possible. Thus feedback is being sought as to the composition of this set, whether there are characters missing from it, or characters in it that should not be.

Related to this new property are two other properties in Unicode:

The UTC would like feedback about these two sets as well, especially to see if there are characters in Punctuation_Other that should also be in Terminal_Punctuation, or characters in Terminal_Punctuation that should also be in Sentence_Terminal.

The the differences between these sets are listed below for your reference:

In Sentence_Terminal and in Terminal_Punctuation

U+0021 # EXCLAMATION MARK
U+002E # FULL STOP
U+003F # QUESTION MARK
U+0589 # ARMENIAN FULL STOP
U+061F # ARABIC QUESTION MARK
U+06D4 # ARABIC FULL STOP
U+0700 # SYRIAC END OF PARAGRAPH
U+0701 # SYRIAC SUPRALINEAR FULL STOP
U+0702 # SYRIAC SUBLINEAR FULL STOP
U+0964 # DEVANAGARI DANDA
U+0965 # DEVANAGARI DOUBLE DANDA
U+104A # MYANMAR SIGN LITTLE SECTION
U+104B # MYANMAR SIGN SECTION
U+1362 # ETHIOPIC FULL STOP
U+1367 # ETHIOPIC QUESTION MARK
U+1368 # ETHIOPIC PARAGRAPH SEPARATOR
U+166E # CANADIAN SYLLABICS FULL STOP
U+1803 # MONGOLIAN FULL STOP
U+1809 # MONGOLIAN MANCHU FULL STOP
U+1944 # LIMBU EXCLAMATION MARK
U+1945 # LIMBU QUESTION MARK
U+203C # DOUBLE EXCLAMATION MARK
U+203D # INTERROBANG
U+2047 # DOUBLE QUESTION MARK
U+2048 # QUESTION EXCLAMATION MARK
U+2049 # EXCLAMATION QUESTION MARK
U+3002 # IDEOGRAPHIC FULL STOP
U+FE52 # SMALL FULL STOP
U+FE56 # SMALL QUESTION MARK
U+FE57 # SMALL EXCLAMATION MARK
U+FF01 # FULLWIDTH EXCLAMATION MARK
U+FF0E # FULLWIDTH FULL STOP
U+FF1F # FULLWIDTH QUESTION MARK
U+FF61 # HALFWIDTH IDEOGRAPHIC FULL STOP
Total: 34

In Sentence_Terminal, but not in Terminal_Punctuation:

U+055C # ARMENIAN EXCLAMATION MARK
U+055E # ARMENIAN QUESTION MARK
Total: 2

In Terminal_Punctuation, but not Sentence_Terminal

U+002C # COMMA
U+003A # COLON
U+003B # SEMICOLON
U+037E # GREEK QUESTION MARK
U+0387 # GREEK ANO TELEIA
U+060C # ARABIC COMMA
U+061B # ARABIC SEMICOLON
U+0703 # SYRIAC SUPRALINEAR COLON
U+0704 # SYRIAC SUBLINEAR COLON
U+0705 # SYRIAC HORIZONTAL COLON
U+0706 # SYRIAC COLON SKEWED LEFT
U+0707 # SYRIAC COLON SKEWED RIGHT
U+0708 # SYRIAC SUPRALINEAR COLON SKEWED LEFT
U+0709 # SYRIAC SUBLINEAR COLON SKEWED RIGHT
U+070A # SYRIAC CONTRACTION
U+070C # SYRIAC HARKLEAN METOBELUS
U+0E5A # THAI CHARACTER ANGKHANKHU
U+0E5B # THAI CHARACTER KHOMUT
U+1361 # ETHIOPIC WORDSPACE
U+1363 # ETHIOPIC COMMA
U+1364 # ETHIOPIC SEMICOLON
U+1365 # ETHIOPIC COLON
U+1366 # ETHIOPIC PREFACE COLON
U+166D # CANADIAN SYLLABICS CHI SIGN
U+16EB # RUNIC SINGLE PUNCTUATION
U+16EC # RUNIC MULTIPLE PUNCTUATION
U+16ED # RUNIC CROSS PUNCTUATION
U+17D4 # KHMER SIGN KHAN
U+17D5 # KHMER SIGN BARIYOOSAN
U+17D6 # KHMER SIGN CAMNUC PII KUUH
U+17DA # KHMER SIGN KOOMUUT
U+1802 # MONGOLIAN COMMA
U+1804 # MONGOLIAN COLON
U+1805 # MONGOLIAN FOUR DOTS
U+1808 # MONGOLIAN MANCHU COMMA
U+3001 # IDEOGRAPHIC COMMA
U+FE50 # SMALL COMMA
U+FE51 # SMALL IDEOGRAPHIC COMMA
U+FE54 # SMALL SEMICOLON
U+FE55 # SMALL COLON
U+FF0C # FULLWIDTH COMMA
U+FF1A # FULLWIDTH COLON
U+FF1B # FULLWIDTH SEMICOLON
U+FF64 # HALFWIDTH IDEOGRAPHIC COMMA
Total: 44

In Punctuation_Other, but not in Terminal_Punctuation

U+0022 # QUOTATION MARK
U+0023 # NUMBER SIGN
U+0025 # PERCENT SIGN
U+0026 # AMPERSAND
U+0027 # APOSTROPHE
U+002A # ASTERISK
U+002F # SOLIDUS
U+0040 # COMMERCIAL AT
U+005C # REVERSE SOLIDUS
U+00A1 # INVERTED EXCLAMATION MARK
U+00B7 # MIDDLE DOT
U+00BF # INVERTED QUESTION MARK
U+055A # ARMENIAN APOSTROPHE
U+055B # ARMENIAN EMPHASIS MARK
U+055C # ARMENIAN EXCLAMATION MARK
U+055D # ARMENIAN COMMA
U+055E # ARMENIAN QUESTION MARK
U+055F # ARMENIAN ABBREVIATION MARK
U+05BE # HEBREW PUNCTUATION MAQAF
U+05C0 # HEBREW PUNCTUATION PASEQ
U+05C3 # HEBREW PUNCTUATION SOF PASUQ
U+05F3 # HEBREW PUNCTUATION GERESH
U+05F4 # HEBREW PUNCTUATION GERSHAYIM
U+060D # ARABIC DATE SEPARATOR
U+066A # ARABIC PERCENT SIGN
U+066B # ARABIC DECIMAL SEPARATOR
U+066C # ARABIC THOUSANDS SEPARATOR
U+066D # ARABIC FIVE POINTED STAR
U+070B # SYRIAC HARKLEAN OBELUS
U+070D # SYRIAC HARKLEAN ASTERISCUS
U+0970 # DEVANAGARI ABBREVIATION SIGN
U+0DF4 # SINHALA PUNCTUATION KUNDDALIYA
U+0E4F # THAI CHARACTER FONGMAN
U+0F04 # TIBETAN MARK INITIAL YIG MGO MDUN MA
U+0F05 # TIBETAN MARK CLOSING YIG MGO SGAB MA
U+0F06 # TIBETAN MARK CARET YIG MGO PHUR SHAD MA
U+0F07 # TIBETAN MARK YIG MGO TSHEG SHAD MA
U+0F08 # TIBETAN MARK SBRUL SHAD
U+0F09 # TIBETAN MARK BSKUR YIG MGO
U+0F0A # TIBETAN MARK BKA- SHOG YIG MGO
U+0F0B # TIBETAN MARK INTERSYLLABIC TSHEG
U+0F0C # TIBETAN MARK DELIMITER TSHEG BSTAR
U+0F0D # TIBETAN MARK SHAD
U+0F0E # TIBETAN MARK NYIS SHAD
U+0F0F # TIBETAN MARK TSHEG SHAD
U+0F10 # TIBETAN MARK NYIS TSHEG SHAD
U+0F11 # TIBETAN MARK RIN CHEN SPUNGS SHAD
U+0F12 # TIBETAN MARK RGYA GRAM SHAD
U+0F85 # TIBETAN MARK PALUTA
U+104C # MYANMAR SYMBOL LOCATIVE
U+104D # MYANMAR SYMBOL COMPLETED
U+104E # MYANMAR SYMBOL AFOREMENTIONED
U+104F # MYANMAR SYMBOL GENITIVE
U+10FB # GEORGIAN PARAGRAPH SEPARATOR
U+1735 # PHILIPPINE SINGLE PUNCTUATION
U+1736 # PHILIPPINE DOUBLE PUNCTUATION
U+17D8 # KHMER SIGN BEYYAL
U+17D9 # KHMER SIGN PHNAEK MUAN
U+1800 # MONGOLIAN BIRGA
U+1801 # MONGOLIAN ELLIPSIS
U+1807 # MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER
U+180A # MONGOLIAN NIRUGU
U+2016 # DOUBLE VERTICAL LINE
U+2017 # DOUBLE LOW LINE
U+2020 # DAGGER
U+2021 # DOUBLE DAGGER
U+2022 # BULLET
U+2023 # TRIANGULAR BULLET
U+2024 # ONE DOT LEADER
U+2025 # TWO DOT LEADER
U+2026 # HORIZONTAL ELLIPSIS
U+2027 # HYPHENATION POINT
U+2030 # PER MILLE SIGN
U+2031 # PER TEN THOUSAND SIGN
U+2032 # PRIME
U+2033 # DOUBLE PRIME
U+2034 # TRIPLE PRIME
U+2035 # REVERSED PRIME
U+2036 # REVERSED DOUBLE PRIME
U+2037 # REVERSED TRIPLE PRIME
U+2038 # CARET
U+203B # REFERENCE MARK
U+203E # OVERLINE
U+2041 # CARET INSERTION POINT
U+2042 # ASTERISM
U+2043 # HYPHEN BULLET
U+204A # TIRONIAN SIGN ET
U+204B # REVERSED PILCROW SIGN
U+204C # BLACK LEFTWARDS BULLET
U+204D # BLACK RIGHTWARDS BULLET
U+204E # LOW ASTERISK
U+204F # REVERSED SEMICOLON
U+2050 # CLOSE UP
U+2051 # TWO ASTERISKS ALIGNED VERTICALLY
U+2053 # SWUNG DASH
U+2057 # QUADRUPLE PRIME
U+23B6 # BOTTOM SQUARE BRACKET OVER TOP SQUARE BRACKET
U+3003 # DITTO MARK
U+303D # PART ALTERNATION MARK
U+FE30 # PRESENTATION FORM FOR VERTICAL TWO DOT LEADER
U+FE45 # SESAME DOT
U+FE46 # WHITE SESAME DOT
U+FE49 # DASHED OVERLINE
U+FE4A # CENTRELINE OVERLINE
U+FE4B # WAVY OVERLINE
U+FE4C # DOUBLE WAVY OVERLINE
U+FE5F # SMALL NUMBER SIGN
U+FE60 # SMALL AMPERSAND
U+FE61 # SMALL ASTERISK
U+FE68 # SMALL REVERSE SOLIDUS
U+FE6A # SMALL PERCENT SIGN
U+FE6B # SMALL COMMERCIAL AT
U+FF02 # FULLWIDTH QUOTATION MARK
U+FF03 # FULLWIDTH NUMBER SIGN
U+FF05 # FULLWIDTH PERCENT SIGN
U+FF06 # FULLWIDTH AMPERSAND
U+FF07 # FULLWIDTH APOSTROPHE
U+FF0A # FULLWIDTH ASTERISK
U+FF0F # FULLWIDTH SOLIDUS
U+FF20 # FULLWIDTH COMMERCIAL AT
U+FF3C # FULLWIDTH REVERSE SOLIDUS
U+10100 # AEGEAN WORD SEPARATOR LINE
U+10101 # AEGEAN WORD SEPARATOR DOT
U+1039F # UGARITIC WORD DIVIDER
Total: 124