L2/07-028
From: |
Asmus Freytag |
Date: |
2007-01-26 |
Re: |
Proposal for an update to the line break algorithm in UAX#14 |
Proposal for an update to the line break algorithm in UAX#14
for discussion at UTC#110.
Proposal
Change the line break rule for nonbreaking characters.
Existing:
LB12 Do not break before or after NBSP and related
characters.
[^SP] × GL
GL ×
Proposed:
LB12a Do not break after NBSP and related
characters.
GL ×
LB12b Do not break before NBSP and related
characters.
[^SP, BA] × GL
Additionally, move rule 12b from the non-tailorable
part of the line break rules to the tailorable part of the line break
rules.
Rationale
Making this limited change will allow existing, long-standing practice
be conformant with UAX#14. Making these implementations non-compliant,
even with tailoring, was never the intent.
While the proposal allows hyphens (class BA) to override the effect of
a following GL (non brekaing) character, it retains the concept that
the class GL represents characters with important, normative
properties, i.e. that of being non-breaking.
Because allowing a break after hyphen, SHY, etc. in front of NBHY etc.
is useful for some languages in its own right, the proposal also
recommends that the default rule be changed to recognize not only SP
but also BA as overriding the non-breaking nature of a following GL
character. (See Background).
WJ can be used in a context <BA, WJ, GL> where true non-breaking
behavior following a BA is required. Additionally, moving rule 12b to
the tailorable part of the rules, allows implementations to adjust this
behavior further (as well as allow Unicode 5.0.0 compliant
implementations to retain compliance via declaration of a tailoring
that doesn't require changes in their code).
No changes in assigned properties are proposed.
Background
There are linebreaking conventions that
modify the appearance of a line break when the line break opportunity
is based on an explicit hyphen. In Polish, explicit hyphens are always
promoted to the next line if a line break occurs at that location in
the text. For example, if, given the sentence "Tam wisi
czerwono-niebieska flaga" ("There hangs a red-blue flag"), the optimal
line break occurs at the location of the explicit hyphen, an additional
hyphen will be displayed at the beginning of the next line like this:
Tam wisi czerwono-
-niebieska flaga.
The same convention is used in Portuguese,
where the use of hyphens is commone, because it is mandatory for verbs
forms that include a pronoun. There are examples where homographs or
ambiguity may arise if hyphens are treated incorrectly: "disparate"
means "folly" while "dispara-te" means "fire yourself" (or "fires onto
you"). Therefore the former needs to be line broken as
dispara-
te
and the latter as
dispara-
-te.
The practice of typing <SHY, NBHY>
instead of <HYPHEN> to achieve promotion of the hyphen to the
next line is reportedly common and is supported by several major text
layout applications and at least one major browser.
However, this is not supported by
the algorithm as specified in version 5.0.0 of UAX#14, and tailoring of
the properties of NBHY are not permitted.
The same software investigated also
supports breaking in the case of <HYPHEN, NBSP>, <HYPHEN,
NBHY> etc., therefore this behavior is not limited to contexts
involving SHY and cannot be addressed by a more narrowly tailored
proposal.