L2/03-145

Re: Proposed Sentence_Terminal
From: Mark Davis
Date: 2003-04-18

We received a request to have the Term boundary class in UAX #29's Sentence Boundaries be a regular property. Currently it is a proper subset of the property Terminal_Punctuation, but is the only boundary class in UAX #29 that is just a big ol' list: not a derived from other properties (possibly with a small list of exceptions). So the proposal for discussion is:

  1. Make a new property in Prop_List called Sentence_Terminal (STerm), with the contents listed in (A) below.
  2. Modify UAX #29 so that Term = STerm - ATerm.

The proposed list (A) is followed by two comparison lists:

(B) characters in Terminal_Punctuation that are not in STerm, and
(C) characters in General_Category = Other_Punctuation (Po) that are not in Terminal_Punctuation.

By the way, with a casual look at these lists, it seems like some of the characters should be moved up from C to B and from B to A; for example, if COMMA qualifies for Terminal_Punctuation it seems that AEGEAN WORD SEPARATOR DOT should also.

A. Sentence_Terminal (STerm):
U+0021 # (!) Po EXCLAMATION MARK
U+002E # (.) Po FULL STOP
U+003F # (?) Po QUESTION MARK
U+0589 # (։) Po ARMENIAN FULL STOP
U+061F # (؟) Po ARABIC QUESTION MARK
U+06D4 # (۔) Po ARABIC FULL STOP
U+0700 # (܀) Po SYRIAC END OF PARAGRAPH
U+0701 # (܁) Po SYRIAC SUPRALINEAR FULL STOP
U+0702 # (܂) Po SYRIAC SUBLINEAR FULL STOP
U+0964 # (।) Po DEVANAGARI DANDA
U+104A # (၊) Po MYANMAR SIGN LITTLE SECTION
U+104B # (။) Po MYANMAR SIGN SECTION
U+1362 # (።) Po ETHIOPIC FULL STOP
U+1367 # (፧) Po ETHIOPIC QUESTION MARK
U+1368 # (፨) Po ETHIOPIC PARAGRAPH SEPARATOR
U+166E # (᙮) Po CANADIAN SYLLABICS FULL STOP
U+1803 # (᠃) Po MONGOLIAN FULL STOP
U+1809 # (᠉) Po MONGOLIAN MANCHU FULL STOP
U+203C # (‼) Po DOUBLE EXCLAMATION MARK
U+203D # (‽) Po INTERROBANG
U+2047 # (⁇) Po DOUBLE QUESTION MARK
U+2048 # (⁈) Po QUESTION EXCLAMATION MARK
U+2049 # (⁉) Po EXCLAMATION QUESTION MARK
U+3002 # (。) Po IDEOGRAPHIC FULL STOP
U+FE52 # (﹒) Po SMALL FULL STOP
U+FE57 # (﹗) Po SMALL EXCLAMATION MARK
U+FF01 # (!) Po FULLWIDTH EXCLAMATION MARK
U+FF0E # (.) Po FULLWIDTH FULL STOP
U+FF1F # (?) Po FULLWIDTH QUESTION MARK
U+FF61 # (。) Po HALFWIDTH IDEOGRAPHIC FULL STOP

B. Terminal_Punctuation but not in Sentence_Terminal:

U+002C # (,) Po COMMA
U+003A # (:) Po COLON
U+003B # (;) Po SEMICOLON
U+037E # (;) Po GREEK QUESTION MARK
U+0387 # (·) Po GREEK ANO TELEIA
U+060C # (،) Po ARABIC COMMA
U+061B # (؛) Po ARABIC SEMICOLON
U+0703 # (܃) Po SYRIAC SUPRALINEAR COLON
U+0704 # (܄) Po SYRIAC SUBLINEAR COLON
U+0705 # (܅) Po SYRIAC HORIZONTAL COLON
U+0706 # (܆) Po SYRIAC COLON SKEWED LEFT
U+0707 # (܇) Po SYRIAC COLON SKEWED RIGHT
U+0708 # (܈) Po SYRIAC SUPRALINEAR COLON SKEWED LEFT
U+0709 # (܉) Po SYRIAC SUBLINEAR COLON SKEWED RIGHT
U+070A # (܊) Po SYRIAC CONTRACTION
U+070C # (܌) Po SYRIAC HARKLEAN METOBELUS
U+0965 # (॥) Po DEVANAGARI DOUBLE DANDA
U+0E5A # (๚) Po THAI CHARACTER ANGKHANKHU
U+0E5B # (๛) Po THAI CHARACTER KHOMUT
U+1361 # (፡) Po ETHIOPIC WORDSPACE
U+1363 # (፣) Po ETHIOPIC COMMA
U+1364 # (፤) Po ETHIOPIC SEMICOLON
U+1365 # (፥) Po ETHIOPIC COLON
U+1366 # (፦) Po ETHIOPIC PREFACE COLON
U+166D # (᙭) Po CANADIAN SYLLABICS CHI SIGN
U+16EB # (᛫) Po RUNIC SINGLE PUNCTUATION
U+16EC # (᛬) Po RUNIC MULTIPLE PUNCTUATION
U+16ED # (᛭) Po RUNIC CROSS PUNCTUATION
U+17D4 # (។) Po KHMER SIGN KHAN
U+17D5 # (៕) Po KHMER SIGN BARIYOOSAN
U+17D6 # (៖) Po KHMER SIGN CAMNUC PII KUUH
U+17DA # (៚) Po KHMER SIGN KOOMUUT
U+1802 # (᠂) Po MONGOLIAN COMMA
U+1804 # (᠄) Po MONGOLIAN COLON
U+1805 # (᠅) Po MONGOLIAN FOUR DOTS
U+1808 # (᠈) Po MONGOLIAN MANCHU COMMA
U+1944 # (᥄) Po LIMBU EXCLAMATION MARK
U+1945 # (᥅) Po LIMBU QUESTION MARK
U+3001 # (、) Po IDEOGRAPHIC COMMA
U+FE50 # (﹐) Po SMALL COMMA
U+FE51 # (﹑) Po SMALL IDEOGRAPHIC COMMA
U+FE54 # (﹔) Po SMALL SEMICOLON
U+FE55 # (﹕) Po SMALL COLON
U+FE56 # (﹖) Po SMALL QUESTION MARK
U+FF0C # (,) Po FULLWIDTH COMMA
U+FF1A # (:) Po FULLWIDTH COLON
U+FF1B # (;) Po FULLWIDTH SEMICOLON
U+FF64 # (、) Po HALFWIDTH IDEOGRAPHIC COMMA

C. In Po (Other Punctuation), but not in Terminal_Punctuation:

U+0022 # (") Po QUOTATION MARK
U+0023 # (#) Po NUMBER SIGN
U+0025 # (%) Po PERCENT SIGN
U+0026 # (&) Po AMPERSAND
U+0027 # (') Po APOSTROPHE
U+002A # (*) Po ASTERISK
U+002F # (/) Po SOLIDUS
U+0040 # (@) Po COMMERCIAL AT
U+005C # (\) Po REVERSE SOLIDUS
U+00A1 # (¡) Po INVERTED EXCLAMATION MARK
U+00B7 # (·) Po MIDDLE DOT
U+00BF # (¿) Po INVERTED QUESTION MARK
U+055A # (՚) Po ARMENIAN APOSTROPHE
U+055B # (՛) Po ARMENIAN EMPHASIS MARK
U+055C # (՜) Po ARMENIAN EXCLAMATION MARK
U+055D # (՝) Po ARMENIAN COMMA
U+055E # (՞) Po ARMENIAN QUESTION MARK
U+055F # (՟) Po ARMENIAN ABBREVIATION MARK
U+05BE # (־) Po HEBREW PUNCTUATION MAQAF
U+05C0 # (׀) Po HEBREW PUNCTUATION PASEQ
U+05C3 # (׃) Po HEBREW PUNCTUATION SOF PASUQ
U+05F3 # (׳) Po HEBREW PUNCTUATION GERESH
U+05F4 # (״) Po HEBREW PUNCTUATION GERSHAYIM
U+060D # (؍) Po ARABIC DATE SEPARATOR
U+066A # (٪) Po ARABIC PERCENT SIGN
U+066B # (٫) Po ARABIC DECIMAL SEPARATOR
U+066C # (٬) Po ARABIC THOUSANDS SEPARATOR
U+066D # (٭) Po ARABIC FIVE POINTED STAR
U+070B # (܋) Po SYRIAC HARKLEAN OBELUS
U+070D # (܍) Po SYRIAC HARKLEAN ASTERISCUS
U+0970 # (॰) Po DEVANAGARI ABBREVIATION SIGN
U+0DF4 # (෴) Po SINHALA PUNCTUATION KUNDDALIYA
U+0E4F # (๏) Po THAI CHARACTER FONGMAN
U+0F04 # (༄) Po TIBETAN MARK INITIAL YIG MGO MDUN MA
U+0F05 # (༅) Po TIBETAN MARK CLOSING YIG MGO SGAB MA
U+0F06 # (༆) Po TIBETAN MARK CARET YIG MGO PHUR SHAD MA
U+0F07 # (༇) Po TIBETAN MARK YIG MGO TSHEG SHAD MA
U+0F08 # (༈) Po TIBETAN MARK SBRUL SHAD
U+0F09 # (༉) Po TIBETAN MARK BSKUR YIG MGO
U+0F0A # (༊) Po TIBETAN MARK BKA- SHOG YIG MGO
U+0F0B # (་) Po TIBETAN MARK INTERSYLLABIC TSHEG
U+0F0C # (༌) Po TIBETAN MARK DELIMITER TSHEG BSTAR
U+0F0D # (།) Po TIBETAN MARK SHAD
U+0F0E # (༎) Po TIBETAN MARK NYIS SHAD
U+0F0F # (༏) Po TIBETAN MARK TSHEG SHAD
U+0F10 # (༐) Po TIBETAN MARK NYIS TSHEG SHAD
U+0F11 # (༑) Po TIBETAN MARK RIN CHEN SPUNGS SHAD
U+0F12 # (༒) Po TIBETAN MARK RGYA GRAM SHAD
U+0F85 # (྅) Po TIBETAN MARK PALUTA
U+104C # (၌) Po MYANMAR SYMBOL LOCATIVE
U+104D # (၍) Po MYANMAR SYMBOL COMPLETED
U+104E # (၎) Po MYANMAR SYMBOL AFOREMENTIONED
U+104F # (၏) Po MYANMAR SYMBOL GENITIVE
U+10FB # (჻) Po GEORGIAN PARAGRAPH SEPARATOR
U+1735 # (᜵) Po PHILIPPINE SINGLE PUNCTUATION
U+1736 # (᜶) Po PHILIPPINE DOUBLE PUNCTUATION
U+17D8 # (៘) Po KHMER SIGN BEYYAL
U+17D9 # (៙) Po KHMER SIGN PHNAEK MUAN
U+1800 # (᠀) Po MONGOLIAN BIRGA
U+1801 # (᠁) Po MONGOLIAN ELLIPSIS
U+1807 # (᠇) Po MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER
U+180A # (᠊) Po MONGOLIAN NIRUGU
U+2016 # (‖) Po DOUBLE VERTICAL LINE
U+2017 # (‗) Po DOUBLE LOW LINE
U+2020 # (†) Po DAGGER
U+2021 # (‡) Po DOUBLE DAGGER
U+2022 # (•) Po BULLET
U+2023 # (‣) Po TRIANGULAR BULLET
U+2024 # (․) Po ONE DOT LEADER
U+2025 # (‥) Po TWO DOT LEADER
U+2026 # (…) Po HORIZONTAL ELLIPSIS
U+2027 # (‧) Po HYPHENATION POINT
U+2030 # (‰) Po PER MILLE SIGN
U+2031 # (‱) Po PER TEN THOUSAND SIGN
U+2032 # (′) Po PRIME
U+2033 # (″) Po DOUBLE PRIME
U+2034 # (‴) Po TRIPLE PRIME
U+2035 # (‵) Po REVERSED PRIME
U+2036 # (‶) Po REVERSED DOUBLE PRIME
U+2037 # (‷) Po REVERSED TRIPLE PRIME
U+2038 # (‸) Po CARET
U+203B # (※) Po REFERENCE MARK
U+203E # (‾) Po OVERLINE
U+2041 # (⁁) Po CARET INSERTION POINT
U+2042 # (⁂) Po ASTERISM
U+2043 # (⁃) Po HYPHEN BULLET
U+204A # (⁊) Po TIRONIAN SIGN ET
U+204B # (⁋) Po REVERSED PILCROW SIGN
U+204C # (⁌) Po BLACK LEFTWARDS BULLET
U+204D # (⁍) Po BLACK RIGHTWARDS BULLET
U+204E # (⁎) Po LOW ASTERISK
U+204F # (⁏) Po REVERSED SEMICOLON
U+2050 # (⁐) Po CLOSE UP
U+2051 # (⁑) Po TWO ASTERISKS ALIGNED VERTICALLY
U+2053 # (⁓) Po SWUNG DASH
U+2057 # (⁗) Po QUADRUPLE PRIME
U+23B6 # (⎶) Po BOTTOM SQUARE BRACKET OVER TOP SQUARE BRACKET
U+3003 # (〃) Po DITTO MARK
U+303D # (〽) Po PART ALTERNATION MARK
U+FE30 # (︰) Po PRESENTATION FORM FOR VERTICAL TWO DOT LEADER
U+FE45 # (﹅) Po SESAME DOT
U+FE46 # (﹆) Po WHITE SESAME DOT
U+FE49 # (﹉) Po DASHED OVERLINE
U+FE4A # (﹊) Po CENTRELINE OVERLINE
U+FE4B # (﹋) Po WAVY OVERLINE
U+FE4C # (﹌) Po DOUBLE WAVY OVERLINE
U+FE5F # (﹟) Po SMALL NUMBER SIGN
U+FE60 # (﹠) Po SMALL AMPERSAND
U+FE61 # (﹡) Po SMALL ASTERISK
U+FE68 # (﹨) Po SMALL REVERSE SOLIDUS
U+FE6A # (﹪) Po SMALL PERCENT SIGN
U+FE6B # (﹫) Po SMALL COMMERCIAL AT
U+FF02 # (") Po FULLWIDTH QUOTATION MARK
U+FF03 # (#) Po FULLWIDTH NUMBER SIGN
U+FF05 # (%) Po FULLWIDTH PERCENT SIGN
U+FF06 # (&) Po FULLWIDTH AMPERSAND
U+FF07 # (') Po FULLWIDTH APOSTROPHE
U+FF0A # (*) Po FULLWIDTH ASTERISK
U+FF0F # (/) Po FULLWIDTH SOLIDUS
U+FF20 # (@) Po FULLWIDTH COMMERCIAL AT
U+FF3C # (\) Po FULLWIDTH REVERSE SOLIDUS
U+10100 # (��) Po AEGEAN WORD SEPARATOR LINE
U+10101 # (��) Po AEGEAN WORD SEPARATOR DOT
U+1039F # (��) Po UGARITIC WORD DIVIDER