From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Fri Nov 30 2007 - 15:29:37 CST
Kenneth Whistler wrote:
> Please see the clarification under the "QU" section in
> the proposed update to UAX #[14]:
>
> http://www.unicode.org/reports/tr14/tr14-21.html
The idea is somewhat implicit there, but it seems to say that a line
break is not allowed between "foo" and (bar) because there might be "
(foo) ". Since a quotation mark could be an opening one, a break between
this and an opening parenthesis is not allowed even if a space
intervenes. How many problems could this rule prevent, and how many
problems does it cause?
I don't think I've even seen " (foo) ". I might imagine such usage
(using guillemets) in French, but French quotation usage is problematic
in a much wider sense, and this artificial rule rule would address just
a minuscule part thereof. In contrast, a parenthetic remark after an
inline quotation is rather normal, and I have actually struggled with
them on some of my web pages. I would rather expect that people who
might write " (foo) " would use no-break spaces there, for example.
> Perhaps Asmus will wade in here with a fuller justification,
> but the consensus in the UTC has been that it is better to
> write out an explicit *default* line breaking specification
> that implementers can (and should) then tailor for specific
> situations and languages, rather than simply letting a thousand
> flowers bloom with no recommendations whatsoever -- which could
> only lead to more unexplained interoperability problems.
I don't think line breaking is really an interoperability problem. It's
something that is performed, or should be performed, when preparing
digital text data for visual presentation, for a particular medium, in a
particular situation. The result is presented to a user, rather than fed
as data into some program.
Programs that do such things vary greatly, even in their line breaking
behavior, ranging from very trivial to highly complicated, often
involving a hyphenator and some routines that optimize the division into
lines at a paragraph level. Thousands of flowers do bloom, and need to
bloom.
> Perhaps. But I think people may not be reading the UAX scoping
> carefully enough:
>
> "... This annex provides more detailed information about
> default line breaking behavior reflecting best practices for
> ^^^^^^^
> the support of multilingual texts."
> ^^^^^^^^^^^^
I hear you, but it seems that software developers don't. Unicode
line-breaking rules have been thrown into their routines, without
considering the consequences well enough. It's a bad idea to _start_
from those rules, rather than using them as fallback. And even as a
fallback, they are questionable.
I think a good criterion for their usefulness is how they work in a
program that only works by them, with no language sensitivity (and
consequently no hyphenation). I would say that at least for scripts that
normally use spaces between words, the results are generally _worse_
than those of a very simplistic algorithm that only breaks at spaces.
Some technical texts that make heavy use of special characters might
look like an exception, but I doubt that: the rules break at wrong
points all too often (and forbid some reasonable breaks).
In any case, the line breaking rules - except for the control characters
and a few other issues - would best be treated as distinct from the
Unicode Standard, since they belong to a different protocol level. And
they might benefit from simplification. I'd say they are too complex
because they try to handle things that cannot be adequately handled at
the general, language-independent level and without higher-level
protocol tools, such as formatting commands, markup, or program options.
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Fri Nov 30 2007 - 15:31:10 CST