Whitespace characters in Unicode
lists+unicode at seantek.com
Sun Aug 7 18:08:58 CDT 2016
On 8/5/2016 10:07 AM, Markus Scherer wrote:
> On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard
> <lists+unicode at seantek.com <mailto:lists+unicode at seantek.com>> wrote:
> What makes a character a "whitespace" in Unicode, e.g., why are
> ZWSP and ZWNBSP not "whitespace" even though they clearly say
> "SPACE" in them?
> I think "white space" basically wants to have an advance width (occupy
> space) but no ink (all white, no black) :-)
Yes, that is the thought that I had as well: whitespace characters
always generate blank space between graphemes, whether horizontal or
> ZWSP and ZWNBSP affect word and line breaking but have no advance width.
I suppose that these are "SPACE" characters, but not "WHITE space"
characters, since there is no white in them. :)
> Note that character names can be misleading, plain wrong, or even just
> misspelled, but they cannot be changed. Best to read the
> documentation. The charts are a good start:
> In particular, don't build sets of Unicode characters just based on
> character name patterns. Use character properties as much as possible.
> What are "Unicode-y" ways to compute word boundaries?
> Related to prior question--I suppose ZWSP is not "whitespace", but
> like whitespace, it separates words. I suppose that since it is
> not printable, it is "confusing", and therefore should be avoided
> in contexts where the printed representation of Unicode code
> points matters.
> Depends on what you do.
> Normal text needs ZWSP & ZWNBSP, for example for proper word wrapping
> and line breaking in a browser or text field/editor.
> They are not allowed in identifiers, and removed from domain names
> (UTS #46).
> Why is Pattern_White_Space significantly disjoint from
> White_Space, namely, why does Pattern_White_Space include LTRM and
> RTLM (and notably LS and PS) yet omit the spaces U+1680 and in the
> U+2000 range?
> We wanted a simple, immutable definition for rule and pattern strings
> that programmers write and maintain. We included LRM and RLM so that
> they can be used (and will be ignored) in rules, for example collation
> rule strings, to keep them moderately readable when they contain RTL
> characters. Typographic spaces are unnecessary in this context, and
> could be confusing.
> In hindsight, LS and PS are probably mistakes. When we came up
> with Pattern_White_Space, we still liked the idea of unambiguous
> end-of-line controls, but in practice it looks like no one really uses
> them. Anyone who cares uses markup or rich-text formats. (Markup was
> not common when Unicode was "born".)
I like the premise of LS and PS: one (well, two) unambiguous characters
to rule them all. But the execution was lacking, to put it mildly. And
there aren't two keys on a common keyboard to distinguish between line
and paragraph separation.
On 8/6/2016 11:30 AM, Doug Ewell wrote:
> Additionally, in UTF-8, either LS or PS actually takes more bytes than
> CR plus LF, so the "increased text size" argument also discouraged use
> of the new controls.
That is true, it takes 3 bytes. However, the original UTF-8 proposal
encoded U+0080 - U+207F in two octets: https://en.wikipedia.org/wiki/UTF-8 :
So, the space block /just barely makes it/. Was this intentional during
the original design of UTF-8, or just a coincidence? I think it was more
than a coincidence. It is regrettable that the space block was too high
to work in the final version of UTF-8...maybe it should have gone below
(More motivation for my whitespace question in following message...)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode