L2/02-071

To: UTC
Re: Default Sentence Boundary Definition
From: Mark Davis
Date: 2002-02-08

After looking at the default word boundaries (L2/02-047), I also took a look at the default sentence boundaries. They also needs some work. As opposed to the word boundaries, I don't think we can settle this one in the meeting, but we can start work on it.

This is in Table 5-6 in <http://www.unicode.org/unicode/uni2book/ch05.pdf>. The basic problems are that it:

Here is a first -- very rough -- pass at a fix:

Table 5-6. Default Sentence Boundaries


Character Classes

CR Carriage Return
LF Linefeed
Sep CR | LF | NEL | LS | PS
Sp Whitespace - Sep
Term Terminal_Punctuation1 OR Terminal_Punctuation2
ATerm Terminal_Punctuation2
Lower Lowercase OR General_Category = Letter OR General_Category = Modifier_Symbol
Upper Uppercase | General_Category = Titlecase_Letter
Open General_Category = Open_Punctuation

Rules

Don't break CRLF; otherwise break after paragraph separators

CR × LF
Sep ÷

Each Ignorable is treated as if it were the type of the previous letter.

X Ignorable => X X

Don't break after ambiguous terminators like period if the first following letter is lowercase, or if the preceding word is contains an uppercase letter. For example, a period may be an abbreviation or numeric period, and not mark the end of a sentence.

ATerm Close* Sp* {Sep} × ( ¬Letter )* Lowercase
 Upper Lower* ATerm Close* Sp* {Sep} ×

Break after sentence terminators, but include closing punctuation, trailing spaces, and (optionally) a paragraph separator.

Term Close* Sp* {Sep} ÷

Remember that these are, like the others, default boundaries, and may be tailored. We cannot easily, without much more sophisticated analysis, distinguish between cases like:

He said, "Are you going?" Mr. Smith shook his head.
"Are you going?" Mr. Smith asked.

Now, what are these numbered Terminal_Punctuations? Looking at the UCD, we have a class called Terminal_Punctuation that is a guideline to what we need, but includes three kinds of characters that we have to separate out for this. (And we may find more!)

Here is my first, very rough, cut:

Terminal_Punctuation1: characters that definitely ends sentence.
0021          ; Terminal_Punctuation # Po       EXCLAMATION MARK
003F          ; Terminal_Punctuation # Po       QUESTION MARK
037E          ; Terminal_Punctuation # Po       GREEK QUESTION MARK
061F          ; Terminal_Punctuation # Po       ARABIC QUESTION MARK
06D4          ; Terminal_Punctuation # Po       ARABIC FULL STOP
203C..203D    ; Terminal_Punctuation # Po   [2] DOUBLE EXCLAMATION MARK..INTERROBANG
3002          ; Terminal_Punctuation # Po       IDEOGRAPHIC FULL STOP
2048..2049    ; Terminal_Punctuation # Po   [2] QUESTION EXCLAMATION MARK..EXCLAMATION QUESTION MARK

Terminal_Punctuation2: characters that could be part of an abbreviation or end a sentence.
002E          ; Terminal_Punctuation # Po       FULL STOP
0589          ; Terminal_Punctuation # Po       ARMENIAN FULL STOP
3001          ; Terminal_Punctuation # Po       IDEOGRAPHIC COMMA

Terminal_Punctuation3: characters irrelevant to sentence boundaries
002C          ; Terminal_Punctuation # Po       COMMA
003A..003B    ; Terminal_Punctuation # Po   [2] COLON..SEMICOLON
0387          ; Terminal_Punctuation # Po       GREEK ANO TELEIA
060C          ; Terminal_Punctuation # Po       ARABIC COMMA
061B          ; Terminal_Punctuation # Po       ARABIC SEMICOLON

Oh coony day! (leveraging my French)
0700..070A    ; Terminal_Punctuation # Po  [11] SYRIAC END OF PARAGRAPH..SYRIAC CONTRACTION
070C          ; Terminal_Punctuation # Po       SYRIAC HARKLEAN METOBELUS
0964..0965    ; Terminal_Punctuation # Po   [2] DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA
0E5A..0E5B    ; Terminal_Punctuation # Po   [2] THAI CHARACTER ANGKHANKHU..THAI CHARACTER KHOMUT
104A..104B    ; Terminal_Punctuation # Po   [2] MYANMAR SIGN LITTLE SECTION..MYANMAR SIGN SECTION
1361..1368    ; Terminal_Punctuation # Po   [8] ETHIOPIC WORDSPACE..ETHIOPIC PARAGRAPH SEPARATOR
166D..166E    ; Terminal_Punctuation # Po   [2] CANADIAN SYLLABICS CHI SIGN..CANADIAN SYLLABICS FULL STOP
16EB..16ED    ; Terminal_Punctuation # Po   [3] RUNIC SINGLE PUNCTUATION..RUNIC CROSS PUNCTUATION
17D4..17D6    ; Terminal_Punctuation # Po   [3] KHMER SIGN KHAN..KHMER SIGN CAMNUC PII KUUH
17DA          ; Terminal_Punctuation # Po       KHMER SIGN KOOMUUT
1802..1805    ; Terminal_Punctuation # Po   [4] MONGOLIAN COMMA..MONGOLIAN FOUR DOTS
1808..1809    ; Terminal_Punctuation # Po   [2] MONGOLIAN MANCHU COMMA..MONGOLIAN MANCHU FULL STOP