Re: NNBSP

From: Marcel Schneider via Unicode <unicode_at_unicode.org>
Date: Fri, 18 Jan 2019 21:41:38 +0100

On 18/01/2019 19:20, Asmus Freytag via Unicode wrote:
> On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:
>>>
>>> Covering existing character sets (National, International and Industry) was _an_ (not "the") important goal at the time: such coverage was understood as a necessary (although not sufficient) condition that would enable data migration to Unicode as well as enable Unicode-based systems to process and display non-Unicode data (by conversion).
>>>
>> I’d take this as a touchstone to infer that there were actual data files including standard typographic spaces as encoded in U+2000..U+2006, and electronic table layout using these: “U+2007 figure space has a fixed width, known as tabular width, which is the same width as digits used in tables. U+2008 punctuation space is a space defined to be the same width as a period.”
>> Is that correct?
>
> May I remind you that the beginnings of Unicode predate the development of the world wide web. By 1993 the web had developed to where it was possible to easily access material written in different scripts and language, and by today it is certainly possible to "sample" material to check for character usage.
>
> When Unicode was first developed, it was best to work from the definition of character sets and to assume that anything encoded in a give set was also used somewhere. Several corporations had assembled supersets of character sets that their products were supporting. The most extensive was a collection from IBM. (I'm blanking out on the name for this).
>
> These collections, which often covered international standard character sets as well, were some of the prime inputs into the early drafts of Unicode. With the merger with ISO 10646 some characters from that effort, but not in the early Unicode drafts, were also added.
>
> The code points from U+2000..U+2008 are part of that early collection.
>
> Note, that prior to Unicode, no character set standard described in detail how characters were to be used (with exception, perhaps of control functions). Mostly, it was assumed that users knew what these characters were and the function of the character set was just to give a passive enumeration.
>
> Unicode's character property model changed all that - but that meant that properties for all of the characters had to be determined long after they were first encoded in the original sources, and with only scant hints of the identity of what these were intended to be. (Often, the only hint was a character name and a rather poor bitmapped image).
>
> If you want to know the "legacy" behavior for these characters, it is more useful, therefore, to see how they have been supported in existing software, and how they have been used in documents since then. That gives you a baseline for understanding whether any change or clarification of the properties of one of these code points will break "existing practice".
>
> Breaking existing practice should be a dealbreaker, no matter how well-intentioned a change is. The only exception is where existing implementations are de-facto useless, because of glaring inconsistencies or other issues. In such exceptional cases, deprecating some interpretations of  character may be a net win.
>
> However, if there's a consensus interpretation of a given character the you can't just go in and change it, even if it would make that character work "better" for a given circumstance: you simply don't know (unless you research widely) how people have used that character in documents that work for them. Breaking those documents retroactively, is not acceptable.
>
That is however what was proposed to do in PRI #308: change Gc of NNBSP from Zs to Pc (not to Cf, as I mistakenly quoted from memory, confusing with the *MONGOLIAN SUFFIX CONNECTOR, that would be a format control). That would break for example those implementations relying on Gc=Zs for the purpose of applying a background color to all (otherwise invisible) space characters.

By the occasion of that Public Review Issue, J. S. Choi reported another use case of NNBSP: between an integer and a vulgar fraction, pointing an error in TUS version 8.0 by the way: “the THIN SPACE does not prevent line breaking from occurring, which is required in style guides such as the Chicago Manual of Style”. ― In version 11.0 the erroneous part is still uncorrected: “If the fraction is to be separated from a previous number, then a space can be used, choosing the appropriate width (normal, thin, zero width, and so on). For example, 1 + thin space + 3 + fraction slash + 4 is displayed as 1¾.”  Note that TUS has typeset this with the precomposed U+00BE, not with plain digits and fraction slash.

If U+2008 PUNCTUATION SPACE is used as intended, changing its line break property from A to GL does not break any implementation nor document. As of possible misuse of the character in ways other than intended, generally there is no point in using as breakable space a space that is actually just a thin variant of U+2007 FIGURE SPACE.

Hence the question, again: Why was PUNCTUATION SPACE not declared as non-breakable?

Marcel

That sample also raises concern, as it showcases how much is done or not done, as appropriate, to keep NNBSP off the usage in Latin script. To what avail?
Received on Fri Jan 18 2019 - 14:42:01 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 18 2019 - 14:42:01 CST