From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Fri Nov 30 2007 - 15:29:37 CST
Kenneth Whistler wrote:
> Please see the clarification under the "QU" section in
> the proposed update to UAX #[14]:
>
> http://www.unicode.org/reports/tr14/tr14-21.html
The idea is somewhat implicit there, but it seems to say that a line 
break is not allowed between "foo" and (bar) because there might be " 
(foo) ". Since a quotation mark could be an opening one, a break between 
this and an opening parenthesis is not allowed even if a space 
intervenes. How many problems could this rule prevent, and how many 
problems does it cause?
I don't think I've even seen " (foo) ". I might imagine such usage 
(using guillemets) in French, but French quotation usage is problematic 
in a much wider sense, and this artificial rule rule would address just 
a minuscule part thereof. In contrast, a parenthetic remark after an 
inline quotation is rather normal, and I have actually struggled with 
them on some of my web pages. I would rather expect that people who 
might write " (foo) " would use no-break spaces there, for example.
> Perhaps Asmus will wade in here with a fuller justification,
> but the consensus in the UTC has been that it is better to
> write out an explicit *default* line breaking specification
> that implementers can (and should) then tailor for specific
> situations and languages, rather than simply letting a thousand
> flowers bloom with no recommendations whatsoever -- which could
> only lead to more unexplained interoperability problems.
I don't think line breaking is really an interoperability problem. It's 
something that is performed, or should be performed, when preparing 
digital text data for visual presentation, for a particular medium, in a 
particular situation. The result is presented to a user, rather than fed 
as data into some program.
Programs that do such things vary greatly, even in their line breaking 
behavior, ranging from very trivial to highly complicated, often 
involving a hyphenator and some routines that optimize the division into 
lines at a paragraph level. Thousands of flowers do bloom, and need to 
bloom.
> Perhaps. But I think people may not be reading the UAX scoping
> carefully enough:
>
> "... This annex provides more detailed information about
> default line breaking behavior reflecting best practices for
> ^^^^^^^
> the support of multilingual texts."
>                ^^^^^^^^^^^^
I hear you, but it seems that software developers don't. Unicode 
line-breaking rules have been thrown into their routines, without 
considering the consequences well enough. It's a bad idea to _start_ 
from those rules, rather than using them as fallback. And even as a 
fallback, they are questionable.
I think a good criterion for their usefulness is how they work in a 
program that only works by them, with no language sensitivity (and 
consequently no hyphenation). I would say that at least for scripts that 
normally use spaces between words, the results are generally _worse_ 
than those of a very simplistic algorithm that only breaks at spaces. 
Some technical texts that make heavy use of special characters might 
look like an exception, but I doubt that: the rules break at wrong 
points all too often (and forbid some reasonable breaks).
In any case, the line breaking rules - except for the control characters 
and a few other issues - would best be treated as distinct from the 
Unicode Standard, since they belong to a different protocol level. And 
they might benefit from simplification. I'd say they are too complex 
because they try to handle things that cannot be adequately handled at 
the general, language-independent level and without higher-level 
protocol tools, such as formatting commands, markup, or program options.
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/ 
This archive was generated by hypermail 2.1.5 : Fri Nov 30 2007 - 15:31:10 CST