From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sat Dec 01 2007 - 02:45:25 CST
Asmus Freytag wrote:
> The *default* line breaking algorithm in UAX#14 tries to meet several
> constraints.
>
> 1) to be compatible with Kinsoku rules
> 2) to be language neutral
> 3) to be compatible with generic Western rules
I don't see why any of that requires odd-looking rules involving
punctuation marks and special characters, especially when they are in
conflict with item 3.
> The class QU means "either a closing or an opening quotation mark" and
> reflects the lack of knowledge about actual usage. (By the way, some
> languages use the *same* quotation mark as both opening and closing).
Indeed. And we don't really know a character, we shouldn't mess around
with it.
General line breaking rules, to the extent they are needed and can be
meaningfully formulated, should be limited to allowing a break at any
space, allowing breaks in a string script-specific characters if allowed
by the rules of that script, obeying explicit line break prohibitions
and permissions, and disallowing other breaks. In particular, a space
should be treated as breaking, since it is much more natural to treat
special cases (where a break is not permitted) using either no-break
space or higher-level protocol tools than to work against the artificial
line-break prohibitions.
> If the character was an opening mark, you really don't want to have a
> line break after it. The enclosed quotation might start with a space.
When did you last see such a case?
It must be a _very_ rare situation. Why would you include a leading
quote? If you were thinking of the French spacing, then it's a special
issue that needs special attention, not this kind of treatment in a very
rare case. (The French spacing after an opening quotation mark should
really be a narrow no-break space.)
> To fix this, an implementation needs to tailor the assignment of
> linebreak classes to supply additional information. In other words, if
> IE encounters a " and, by some rule not defined in UAX#14, decides
> that
> one of them is in fact an OP and the other is a CL then
... then cows will fly. It is unrealistic to expect that a sophisticated
linguistic analysis will be applied to make a decision in overriding a
line break prohibition. (It is not sufficient to know the language the
text, and the text is generally not known. The language markup is
currently rarely used and far too often plain wrong to be trusted. Doing
language-guessing on an entire document is feasible, though not very
reliable, but this would have to be made at the phrase level.)
The line breaking rules often appear to be based in the consideration of
_some_ special cases (and perhaps _very_ special cases), where they
might help to avoid some problems. But the question is whether they
cause more trouble in other cases and whether the problems could be
solved in simpler ways.
> Using the untailored default algorithm is intended for situations
> where
> the necessary information is
> lacking that would allow an implementer to select a specific
> tailoring. Doing so, results in a better average
> performance (for global text) than implementing ASCII line break
> (break at space and hyphen only),
> which fails abysmally for non-European text.
The issue of breaking normal text - written using letters, syllabic
characters, or ideograms - is quite separate from the issue of
artificial rules that involve punctuation and special characters.
Ascii hyphen, i.e. HYPHEN-MINUS, shouldn't really be treated as allowing
a break, due to its semantic ambiguity and variation in usage. Breaking
after it might be allowed by language- or application-specific rules,
rather than being allowed by default and disallowed by special rules.
The _general_ rules should be simple and conservative, trying to
minimize bad breaks rather than to find as many break opportunities as
possible. When you disallow a break that could be allowed, you may get
suboptimal typography. When you allow a break should not be allowed, you
may distort data, e.g. effectively changing "-1" to "- 1" or "directory
/foo" to "directory / foo". On the other hand, breaking at spaces should
not be restricted by the general rules, since it is reasonable to expect
that spaces are treated as breaking, so that special measures need to be
taken to prevent it.
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Sat Dec 01 2007 - 02:48:35 CST