Re: UAX #14: no line breaks between OP and QU, even if there are intervening spaces

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Fri Nov 30 2007 - 16:42:24 CST

  • Next message: Michael Everson: "Re: Unicode 5.1, Egyptian Transliteration, and Fonts"

    On 11/30/2007 8:47 AM, Jukka K. Korpela wrote:
    > Arnt Richard Johansen wrote:
    >
    >
    >> In UAX #14, rule LB15 states "Do not break within '"[', even with
    >> intervening spaces." This is formalised as
    >>
    >> QU SP* × OP
    >>
    >> What is the rationale behind this rule?
    >>
    >
    > Beats me. Whatever the rationale might be, the rule is harmful more
    > often than useful. I'm afraid the line breaking rules as a whole just
    > try too much: they define detailed rules for combinations, based on the
    > consideration of some _possible_ scenarious where the combinations might
    > appear.
    >
    >
    The reason that this *default* rule produces sub-standard results for
    *English*, is the fact that the use of quotation marks is language
    dependent. As a result, you cannot tell from character code alone
    whether a quotation mark is an opening or closing quotation mark.

    The *default* line breaking algorithm in UAX#14 tries to meet several
    constraints.

    1) to be compatible with Kinsoku rules
    2) to be language neutral
    3) to be compatible with generic Western rules

    this occasionally requires some compromise, but even without that
    aspect, different publishers, languages, etc. already show substantial
    variation in the details. For all of those reasons, the algorithm is not
    fixed, but specifically allows customization, so that each
    implementation can (and should) be tailored to meet the specific needs
    of its users.

    Even so, the case under discussion here is a special one.

    The class QU means "either a closing or an opening quotation mark" and
    reflects the lack of knowledge about actual usage. (By the way, some
    languages use the *same* quotation mark as both opening and closing).

    If the character was an opening mark, you really don't want to have a
    line break after it. The enclosed quotation might start with a space. If
    the character was a closing mark, breaking (even without a space) would
    be fine.The problem is, that if all you know "it's a QU" then you don't
    know which it is.

    To fix this, an implementation needs to tailor the assignment of
    linebreak classes to supply additional information. In other words, if
    IE encounters a " and, by some rule not defined in UAX#14, decides that
    one of them is in fact an OP and the other is a CL then the line breaking of

    "The Wire" (2005) turns from

    QU AL AL AL SP AL AL AL AL QU SP OP NU NU NU NU CL

    to

    OP AL AL AL SP AL AL AL AL CL SP OP NU NU NU NU CL

    and the point of interest becomes CL SP OP which breaks just fine after
    the CL.

    So, the issue is not with the UAX#14, which has no way of knowing which
    quotation marks are opening and closing in what context, but with the
    fact that the implementers did not provide *tailoring*.

    Rule 15 as written errs on the side of preventing a break. A tailoring
    that takes the opposite approach and allows breaks in this case unless
    it's definitely known that QU is opening, is an equally valid tailoring.

    Yukka wrote:

    "Line breaking rules are strongly language- and context-dependent, and
    they shouldn't really be part of the Unicode Standard, except for some
    very basic principles like the special controls for line break. The UAX
    #14 rules are probably based on _some_ rational considerations but
    oriented towards some largely unspecified situations. There is probably
    a lot of language and context dependency hidden in them. And I don't the
    rules have generally been implemented, but they have _partly_ been
    implemented in various programs"

    This is throwing out the baby with the bathwater. First, the rules in
    UAX#14 are not binding, except
    for the case of the kind of special characters he mentions. But, despite
    all variability, there is a lot
    of common functionality so that it makes sense to publish a *default*
    algorithm. To get this optimized
    requires tailoring, and that is explicitly allowed.

    Using the untailored default algorithm is intended for situations where
    the necessary information is
    lacking that would allow an implementer to select a specific tailoring.
    Doing so, results in a better average
    performance (for global text) than implementing ASCII line break (break
    at space and hyphen only),
    which fails abysmally for non-European text.

    A./



    This archive was generated by hypermail 2.1.5 : Fri Nov 30 2007 - 16:43:50 CST