From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sat Nov 27 2004 - 15:48:21 CST
At 04:23 PM 11/26/2004, Peter Kirk wrote:
>As I understand it (and I asked for confirmation of this but have not
>received it), according to the current version of UAX #14 there is no
>break opportunity between SPACE and NBSP, because rule LB11b precedes rule
>LB12, although there is a note "Many existing implementations reverse the
>order of precedence between rules LB11b and LB12." There is a proposed
>update to UAX #14 which has the effect of reversing these rules (except
>for WJ). But until this change has been accepted and fully implemented,
>surely I need to use the ZWSP. Indeed, to be safe I will always need the
>ZWSP as I can never be sure that the update has been implemented.
This is a fine case of mis-applied conservatism.
The issue of relative *strength* of NBSP and SPACE predates Unicode, since
both characters are already available as part of 8859-1 and many other
character sets based on or equivalent to this standard.
The change that the UTC has approved for UAX#14 simply recognizes the fact
that this was not an open issue for Unicode to settle, but an issue long
settled by custom, with implementations found to favor what is now also the
officially recommended approach.
Getting the recommendation in line with existing practice is important to
allow users like you to rely on the behavior of certain specialized
characters, such as NBSP and SPACE, so that you don't need to try to add
ZWSP on suspicion.
It's important to note that, largely, the specification in UAX#14 are not
mandatory, by the way, nor can they be correct for all publishing styles,
languages or types of documents. And they completely punt on South East
Asian scripts, by the way, since those require a different type of algorithm.
They are intended as a pretty serviceable baseline, which, for many not so
demanding applications, could be implemented as-is, and which could serve
as a basis for further tailoring for more sophisticated implementations.
There simply is no portable way to guarantee exactly the same linebreak
behavior across implementations, across protocols and across markup
languages. Where such stability is required, say for legal documents, you
are limited to any of the protocols that express final form documents, such
as PDF.
If you just load up your text with ZWSP you run the risk of encountering an
implementation that does not support ZWSP at all, with potentially
interesting (and unintended) results. I believe your risks there are much
greater than expecting that there is a break between SPACE and NBSP.
A./
PS: The revised text of UAX#14 will not be published until Unicode 4.1, but
the change to the rules has been endorsed by the UTC. While the UTC can
change its mind before publication, it could do so after publication as
well. This is different from assigning character codes, as you know.
This archive was generated by hypermail 2.1.5 : Sat Nov 27 2004 - 15:50:38 CST