L2/19-042

 

Linebreaks inside numbers

Eric Muller, Amazon

January 10, 2019

 

 

From the responses to PRI 322, Proposed Update UAX #14, Unicode Line Breaking Algorithm:

Date/Time: Wed Apr 20 15:19:33 CDT 2016

Name: Andy Heninger

Report Type: Error Report

Opt Subject: UAX 14 feedback, PRI #322

The UAX-14 line breaking of numbers beginning with a decimal point can be bad. Consider the string "start .789 end".

With the default rules there will only be one break, "start .789 |end". Rule LB13, "x IS" will prevent a break before the number.

With the tailoring of numbers from example 7 of section 8.2 there will be an unexpected break after the full stop, yielding "start .|789 |end", because the regular expression for numbers does not allow a character of class IS to precede the first digit.

How this might be fixed will require some thought

This problem was originally reported by Bernhard Fey in an ICU bug report, http://bugs.icu-project.org/trac/ticket/12017

First a piece of background: the class IS contains:

13 Code Points

Basic Latin — ASCII punctuation and symbols
 , 	U+002C	COMMA
 . 	U+002E	FULL STOP
 : 	U+003A	COLON
 ; 	U+003B	SEMICOLON
Greek And Coptic — Punctuation
 ; 	U+037E	GREEK QUESTION MARK
Armenian — Punctuation
 ։ 	U+0589	ARMENIAN FULL STOP
Arabic — Punctuation
 ، 	U+060C	ARABIC COMMA
 ‎؍‎ 	U+060D	ARABIC DATE SEPARATOR
NKo — Punctuation
 ߸ 	U+07F8	NKO COMMA
General Punctuation — General punctuation
 ⁄ 	U+2044	FRACTION SLASH
Vertical Forms — Glyphs for vertical variants
 ︐ 	U+FE10	PRESENTATION FORM FOR VERTICAL COMMA
 ︓ 	U+FE13	PRESENTATION FORM FOR VERTICAL COLON
 ︔ 	U+FE14	PRESENTATION FORM FOR VERTICAL SEMICOLON
 

There are two separate issues here.

The first is the presence of an undesirable break opportunity between the "." and the digit "7". This break opportunity does not exist with LB25 but does exist with its tailoring. The issue is that in the tailoring, numbers must start with a digit: /NU (NU | SY | IS)*/. This can be fixed by recognizing numbers as any sequence of digits with SY or IS before, in between or after, i.e. replacing the fragment above with: /(SY | IS)* NU (NU | SY | IS)*/.

The second is the absence of a break opportunity before the ".", caused by this part of LB13: /× IS/.

Removing that part of LB13 altogether is probably too brutal : in particular the class IS contains U+003A : COLON and U+003B ; SEMICOLON, and disallowing breaks in "foo :" is desirable (at least for French).

Another possibility is to disallow breaks before some subset of the IS class, i.e. to split IS in two, with "." in one subset and ":" and ";" in the other.

Another possibility is to deal with number before LB13. A period is ambigously part of number or more generally punctuation, and the idea is to deal with IS in the specific context of a number first, then to deal with IS in the more general case. The organization would have to something like:

All three solutions listed here have rather deep implications, and the best course of action is not straightforward (not to mention that there could be other solutions).

I would like to stress that the first problem, being a false positive (incorrect break opportunity), is much more important to solve than the second problem, which is a false negative (undetected break opportunity). Also, the first problem seems to have a simple solution. I would like to encourage the UTC to take action on the first problem, even if no action can be decided for the second one.