Re: UAX #14: no line breaks between OP and QU, even if there are intervening spaces

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Fri Nov 30 2007 - 15:29:37 CST

  • Next message: Michael Everson: "Re: Vai word boundaries and line breaking"

    Kenneth Whistler wrote:

    > Please see the clarification under the "QU" section in
    > the proposed update to UAX #[14]:
    >
    > http://www.unicode.org/reports/tr14/tr14-21.html

    The idea is somewhat implicit there, but it seems to say that a line
    break is not allowed between "foo" and (bar) because there might be "
    (foo) ". Since a quotation mark could be an opening one, a break between
    this and an opening parenthesis is not allowed even if a space
    intervenes. How many problems could this rule prevent, and how many
    problems does it cause?

    I don't think I've even seen " (foo) ". I might imagine such usage
    (using guillemets) in French, but French quotation usage is problematic
    in a much wider sense, and this artificial rule rule would address just
    a minuscule part thereof. In contrast, a parenthetic remark after an
    inline quotation is rather normal, and I have actually struggled with
    them on some of my web pages. I would rather expect that people who
    might write " (foo) " would use no-break spaces there, for example.

    > Perhaps Asmus will wade in here with a fuller justification,
    > but the consensus in the UTC has been that it is better to
    > write out an explicit *default* line breaking specification
    > that implementers can (and should) then tailor for specific
    > situations and languages, rather than simply letting a thousand
    > flowers bloom with no recommendations whatsoever -- which could
    > only lead to more unexplained interoperability problems.

    I don't think line breaking is really an interoperability problem. It's
    something that is performed, or should be performed, when preparing
    digital text data for visual presentation, for a particular medium, in a
    particular situation. The result is presented to a user, rather than fed
    as data into some program.

    Programs that do such things vary greatly, even in their line breaking
    behavior, ranging from very trivial to highly complicated, often
    involving a hyphenator and some routines that optimize the division into
    lines at a paragraph level. Thousands of flowers do bloom, and need to
    bloom.

    > Perhaps. But I think people may not be reading the UAX scoping
    > carefully enough:
    >
    > "... This annex provides more detailed information about
    > default line breaking behavior reflecting best practices for
    > ^^^^^^^
    > the support of multilingual texts."
    > ^^^^^^^^^^^^

    I hear you, but it seems that software developers don't. Unicode
    line-breaking rules have been thrown into their routines, without
    considering the consequences well enough. It's a bad idea to _start_
    from those rules, rather than using them as fallback. And even as a
    fallback, they are questionable.

    I think a good criterion for their usefulness is how they work in a
    program that only works by them, with no language sensitivity (and
    consequently no hyphenation). I would say that at least for scripts that
    normally use spaces between words, the results are generally _worse_
    than those of a very simplistic algorithm that only breaks at spaces.
    Some technical texts that make heavy use of special characters might
    look like an exception, but I doubt that: the rules break at wrong
    points all too often (and forbid some reasonable breaks).

    In any case, the line breaking rules - except for the control characters
    and a few other issues - would best be treated as distinct from the
    Unicode Standard, since they belong to a different protocol level. And
    they might benefit from simplification. I'd say they are too complex
    because they try to handle things that cannot be adequately handled at
    the general, language-independent level and without higher-level
    protocol tools, such as formatting commands, markup, or program options.

    Jukka K. Korpela ("Yucca")
    http://www.cs.tut.fi/~jkorpela/



    This archive was generated by hypermail 2.1.5 : Fri Nov 30 2007 - 15:31:10 CST