>Indeed. But surely this would not be taken as a single word
even in the absence of ZWNBSPs: it contains a non-alphabetic
character.
Let's go back to Mark's examples:
>Suppose that there is a natural word break between XY, and no
natural word break between YZ. Then here are the word counts:
XY: 2
YZ: 1
X<ZWSP>Y: 2
Y<ZWSP>Z: 2
X<ZWNBSP>Y: 1
Y<ZWNBSP>Z: 1
In the example I gave (taken from the Unicode book),
"base+delta", the pair "e+" falls into the category Mark gave
for XY. His conclusion that adding ZWNBSP yielded one word.
>Thus this example does not settle the issue of
whether "ap<ZWNBSP>ple" is one word or two.
I was responding to the statement that this is (they are) two.
>In particular, why create such a thing (other than by
accident)?
Perhaps to suppress line-breaking with hyphenation? In that
case "apple" remains one word, and the ZWNBSP is serving as a
mandatory non-hyphen.
I wouldn't normally be inclined to do so in this particular
spot, unless, along the lines you suggest, I were writing about
hyphenation, used "ap-ple" as an illustration, but specifically
did not want this to break across lines to avoid confusion as
to whether the hyphen were part of the example or an artifact
of the layout.
Actually, what I have in mind is hypothetical - I don't know if
this would ever arise, and I can't think of any specific
examples from Thai or another language that would qualify:
In the English string "Mr. Smith", I might prefer not to have a
line break between the words "Mr." and "Smith". Of course, we
have NBSP for that purpose. Suppose, this scenario, however: I
have a corpus of data for a language that, like Thai, is
written without visible spaces between all words, and that I am
using ZWSP to delimit any word boundaries not delimited by SP,
PS, etc. I have, however, certain word pairs that, like "Mr.
Smith", I don't want to break across a line. It seemed obvious
that ZWBNSP is exactly what is needed.
In other words, ZWBNSP is to ZWSP what NBSP is to SP, but
useful mostly for writing systems where not all word boundaries
are overtly indicated with visible space.
Now, perhaps I'm making an assumption here, that
X<SP>Y
Y<SP>Z
X<NBSP>Y
Y<NBSP>Z
would all count as two words for selection and arrow key
movement, though the line breaking behaviour is, of course,
different. This assumption seems reasonable to me: Given the
string
Mr.<NBSP>Smith
I wouldn't want a double click on "M" to select the entire
string, I wouldn't want a keying of CTRL-RT_ARROW to jump over
the entire string, and I wouldn't want my spell checker to
treat it as one word.
If this assumption is valid here, then I would expect ZWSP and
ZWNBSP to relate to one another in exactly the same way that SP
and NBSP do, and I'd expect comparable behaviours from the
former pair as I would for the latter pair, except that the
former don't have width.
That's what I would have expected. Maybe there's a
well-developed understanding in the industry of how these
characters should behave, in which case I'd like to learn more
about that understanding and the basis for it. If not, then I'd
like to suggest this as a possible set of behaviours for these
characters.
Peter
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT