L2/02-071
To: | UTC |
Re: | Default Sentence Boundary Definition |
From: | Mark Davis |
Date: | 2002-02-08 |
After looking at the default word boundaries (L2/02-047), I also took a look at the default sentence boundaries. They also needs some work. As opposed to the word boundaries, I don't think we can settle this one in the meeting, but we can start work on it.
This is in Table 5-6 in <http://www.unicode.org/unicode/uni2book/ch05.pdf>. The basic problems are that it:
is only defined for Latin punctuation
doesn't precisely reference properties in the UCD (predating many of them)
doesn't explicitly handle combining marks (although in practice the latter is not really a problem for this case).
Here is a first -- very rough -- pass at a fix:
Table 5-6. Default Sentence Boundaries
CR | Carriage Return |
LF | Linefeed |
Sep | CR | LF | NEL | LS | PS |
Sp | Whitespace - Sep |
Term | Terminal_Punctuation1 OR Terminal_Punctuation2 |
ATerm | Terminal_Punctuation2 |
Lower | Lowercase OR General_Category = Letter OR General_Category = Modifier_Symbol |
Upper | Uppercase | General_Category = Titlecase_Letter |
Open | General_Category = Open_Punctuation |
Don't break CRLF; otherwise break after paragraph separators |
|||
CR | × | LF | |
Sep | ÷ | ||
Each Ignorable is treated as if it were the type of the previous letter. |
|||
X Ignorable => X X |
|||
Don't break after ambiguous terminators like period if the first following letter is lowercase, or if the preceding word is contains an uppercase letter. For example, a period may be an abbreviation or numeric period, and not mark the end of a sentence. |
|||
ATerm Close* Sp* {Sep} | × | ( ¬Letter )* Lowercase | |
Upper Lower* ATerm Close* Sp* {Sep} | × | ||
Break after sentence terminators, but include closing punctuation, trailing spaces, and (optionally) a paragraph separator. |
|||
Term Close* Sp* {Sep} | ÷ |
Remember that these are, like the others, default boundaries, and may be tailored. We cannot easily, without much more sophisticated analysis, distinguish between cases like:
He said, "Are you going?" Mr. Smith shook his head.
"Are you going?" Mr. Smith asked.
Now, what are these numbered Terminal_Punctuations? Looking at the UCD, we have a class called Terminal_Punctuation that is a guideline to what we need, but includes three kinds of characters that we have to separate out for this. (And we may find more!)
Here is my first, very rough, cut:
Terminal_Punctuation1: characters that definitely ends sentence. 0021 ; Terminal_Punctuation # Po EXCLAMATION MARK 003F ; Terminal_Punctuation # Po QUESTION MARK 037E ; Terminal_Punctuation # Po GREEK QUESTION MARK 061F ; Terminal_Punctuation # Po ARABIC QUESTION MARK 06D4 ; Terminal_Punctuation # Po ARABIC FULL STOP 203C..203D ; Terminal_Punctuation # Po [2] DOUBLE EXCLAMATION MARK..INTERROBANG 3002 ; Terminal_Punctuation # Po IDEOGRAPHIC FULL STOP 2048..2049 ; Terminal_Punctuation # Po [2] QUESTION EXCLAMATION MARK..EXCLAMATION QUESTION MARK Terminal_Punctuation2: characters that could be part of an abbreviation or end a sentence. 002E ; Terminal_Punctuation # Po FULL STOP 0589 ; Terminal_Punctuation # Po ARMENIAN FULL STOP 3001 ; Terminal_Punctuation # Po IDEOGRAPHIC COMMA Terminal_Punctuation3: characters irrelevant to sentence boundaries 002C ; Terminal_Punctuation # Po COMMA 003A..003B ; Terminal_Punctuation # Po [2] COLON..SEMICOLON 0387 ; Terminal_Punctuation # Po GREEK ANO TELEIA 060C ; Terminal_Punctuation # Po ARABIC COMMA 061B ; Terminal_Punctuation # Po ARABIC SEMICOLON Oh coony day! (leveraging my French) 0700..070A ; Terminal_Punctuation # Po [11] SYRIAC END OF PARAGRAPH..SYRIAC CONTRACTION 070C ; Terminal_Punctuation # Po SYRIAC HARKLEAN METOBELUS 0964..0965 ; Terminal_Punctuation # Po [2] DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA 0E5A..0E5B ; Terminal_Punctuation # Po [2] THAI CHARACTER ANGKHANKHU..THAI CHARACTER KHOMUT 104A..104B ; Terminal_Punctuation # Po [2] MYANMAR SIGN LITTLE SECTION..MYANMAR SIGN SECTION 1361..1368 ; Terminal_Punctuation # Po [8] ETHIOPIC WORDSPACE..ETHIOPIC PARAGRAPH SEPARATOR 166D..166E ; Terminal_Punctuation # Po [2] CANADIAN SYLLABICS CHI SIGN..CANADIAN SYLLABICS FULL STOP 16EB..16ED ; Terminal_Punctuation # Po [3] RUNIC SINGLE PUNCTUATION..RUNIC CROSS PUNCTUATION 17D4..17D6 ; Terminal_Punctuation # Po [3] KHMER SIGN KHAN..KHMER SIGN CAMNUC PII KUUH 17DA ; Terminal_Punctuation # Po KHMER SIGN KOOMUUT 1802..1805 ; Terminal_Punctuation # Po [4] MONGOLIAN COMMA..MONGOLIAN FOUR DOTS 1808..1809 ; Terminal_Punctuation # Po [2] MONGOLIAN MANCHU COMMA..MONGOLIAN MANCHU FULL STOP