From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Fri Nov 30 2007 - 16:42:24 CST
On 11/30/2007 8:47 AM, Jukka K. Korpela wrote:
> Arnt Richard Johansen wrote:
>
>
>> In UAX #14, rule LB15 states "Do not break within '"[', even with
>> intervening spaces." This is formalised as
>>
>> QU SP* × OP
>>
>> What is the rationale behind this rule?
>>
>
> Beats me. Whatever the rationale might be, the rule is harmful more
> often than useful. I'm afraid the line breaking rules as a whole just
> try too much: they define detailed rules for combinations, based on the
> consideration of some _possible_ scenarious where the combinations might
> appear.
>
>
The reason that this *default* rule produces sub-standard results for
*English*, is the fact that the use of quotation marks is language
dependent. As a result, you cannot tell from character code alone
whether a quotation mark is an opening or closing quotation mark.
The *default* line breaking algorithm in UAX#14 tries to meet several
constraints.
1) to be compatible with Kinsoku rules
2) to be language neutral
3) to be compatible with generic Western rules
this occasionally requires some compromise, but even without that
aspect, different publishers, languages, etc. already show substantial
variation in the details. For all of those reasons, the algorithm is not
fixed, but specifically allows customization, so that each
implementation can (and should) be tailored to meet the specific needs
of its users.
Even so, the case under discussion here is a special one.
The class QU means "either a closing or an opening quotation mark" and
reflects the lack of knowledge about actual usage. (By the way, some
languages use the *same* quotation mark as both opening and closing).
If the character was an opening mark, you really don't want to have a
line break after it. The enclosed quotation might start with a space. If
the character was a closing mark, breaking (even without a space) would
be fine.The problem is, that if all you know "it's a QU" then you don't
know which it is.
To fix this, an implementation needs to tailor the assignment of
linebreak classes to supply additional information. In other words, if
IE encounters a " and, by some rule not defined in UAX#14, decides that
one of them is in fact an OP and the other is a CL then the line breaking of
"The Wire" (2005) turns from
QU AL AL AL SP AL AL AL AL QU SP OP NU NU NU NU CL
to
OP AL AL AL SP AL AL AL AL CL SP OP NU NU NU NU CL
and the point of interest becomes CL SP OP which breaks just fine after
the CL.
So, the issue is not with the UAX#14, which has no way of knowing which
quotation marks are opening and closing in what context, but with the
fact that the implementers did not provide *tailoring*.
Rule 15 as written errs on the side of preventing a break. A tailoring
that takes the opposite approach and allows breaks in this case unless
it's definitely known that QU is opening, is an equally valid tailoring.
Yukka wrote:
"Line breaking rules are strongly language- and context-dependent, and
they shouldn't really be part of the Unicode Standard, except for some
very basic principles like the special controls for line break. The UAX
#14 rules are probably based on _some_ rational considerations but
oriented towards some largely unspecified situations. There is probably
a lot of language and context dependency hidden in them. And I don't the
rules have generally been implemented, but they have _partly_ been
implemented in various programs"
This is throwing out the baby with the bathwater. First, the rules in
UAX#14 are not binding, except
for the case of the kind of special characters he mentions. But, despite
all variability, there is a lot
of common functionality so that it makes sense to publish a *default*
algorithm. To get this optimized
requires tailoring, and that is explicitly allowed.
Using the untailored default algorithm is intended for situations where
the necessary information is
lacking that would allow an implementer to select a specific tailoring.
Doing so, results in a better average
performance (for global text) than implementing ASCII line break (break
at space and hyphen only),
which fails abysmally for non-European text.
A./
This archive was generated by hypermail 2.1.5 : Fri Nov 30 2007 - 16:43:50 CST