Re: NNBSP

From: Marcel Schneider via Unicode <unicode_at_unicode.org>
Date: Sat, 19 Jan 2019 08:34:13 +0100

On 19/01/2019 01:55, Asmus Freytag via Unicode wrote:
> On 1/18/2019 2:05 PM, Marcel Schneider via Unicode wrote:
>> On 18/01/2019 20:09, Asmus Freytag via Unicode wrote:
>>>
>>> Marcel,
>>>
>>> about your many detailed *technical* questions about the history of character properties, I am afraid I have no specific recollection.
>>>
>> Other List Members are welcome to join in, many of whom are aware of how things happened. My questions are meant to be rather simple. Summing up the premium ones:
>>
>> 1. Why does UTC ignore the need of a non-breakable thin space?
>> 2. Why did UTC not declare PUNCTUATION SPACE non-breakable?
>>
>> A less important information would be how extensively typewriters with proportional advance width were used to write books ready for print.
>>
>> Another question you do answer below:
>>
>>> French is not the only language that uses a space to group figures. In fact, I grew up with thousands separators being spaces, but in much of the existing publications or documents there was certainly a full (ordinary) space being used. Not surprisingly, because in those years documents were typewritten and even many books were simply reproduced from typescript.
>>>
>>> When it comes to figures, there are two different types of spaces.
>>>
>>> One is a space that has the same width a digit and is used in the layout of lists. For example, if you have a leading currency symbol, you may want to have that lined up on the left and leave the digits representing the amounts "ragged". You would fill the intervening spaces with this "lining" space character and everything lines up.
>>>
>> That is exactly how I understood hot-metal typesetting of tables. What surprises me is why computerized layout does work the same way instead of using tabulations and appropriate tab stops (left, right, centered, decimal [with all decimal separators lining up vertically).
>
> ==> At the time Unicode was first created (and definitely before that, during the time of non-universal character sets) many applications existed that used a "typewriter model" and worked by space fill rather than decimal-point tabulation.
>
If you are talking about applications, as opposed to typesetting tables for book printing, then I’d suggest that the fixed-width display of tables could be done much like still today’s source code layout, where normal space is used for that purpose. In this use case, line wrap is typically turned off. That could make non-breakable spaces sort of pointless (but I’m aware of your point below), except if people are expected to re-use the data in other environments. In that case, best practice is to use NNBSP as thousands separator while displaying it like other monospace characters. That’s at least how today’s monospace fonts work (provided they’re used in environments actually supporting Unicode, which may not happen with applications running in terminal).
>
> From today's perspective that older model is inflexible and not the best approach, but it is impossible to say how long this legacy approach hung on in some places and how much data might exist that relied on certain long-standing behaviors of these space characters.
>
My position since some time is that legacy apps should use legacy libraries. But I’ll come back on this when responding to Shawn Steele.
>
> For a good solution, you always need to understand
>
> (1) the requirement of your "index" case (French, in this case)
>
That’s okay.
>
> (2) how it relates to similar requirements in (all!) other languages / scripts
>
That’s rather up to CLDR as I suggested, given it has the means to submit a point to all vetters. See again below (in the part that you’ve cut off without consideration).
>
> (3) how it relates to actual legacy practice
>
That’s Shawn Steele’s point (see next reply).
>
> (3a) what will suddenly no longer work if you change the properties on some character
>
> (3b) what older data will no longer work if the effective behavior of newer applications changes
>
I’ll already note that this needs to be aware of actual use cases and/or to delve into the OSes, that is far beyond what I can currently do, both wrt time and wrt resources. The vetter’s role is to inform CLDR with correct data from their locale. CLDR is then welcome to sort things out and to get in touch with the industry, which CLDR TC is actually doing. But that has no impact on the data submitted at survey time. Changing votes to tell “OK let the group separator be NBSP as long as…” would be a lie.
>
>>> In lists like that, you can get away with not using a narrow thousands separator, because the overall context of the list indicates which digits belong together and form a number. Having a narrow space may still look nicer, but complicates the space fill between the symbol and the digits.
>>>
>> It does not, provided that all numbers have thousands separators, even if filling with spaces. It looks nicer because it’s more legible.
>>>
>>> Now for numbers in running text using an ordinary space has multiple drawbacks. It's definitely less readable and, in digital representation, if you use 0020 you don't communicate that this is part of a single number that's best not broken across lines.
>>>
>> Right.
>>>
>>> The problem Unicode had is that it did not properly understand which of the two types of "numeric" spaces was represented by "figure space". (I remember that we had discussions on that during the early years, but that they were not really resolved and that we moved on to other issues, of which many were demanding attention).
>>>
>> You were discussing whether the thousands separator should have the width of a digit or the width of a period? Consistently with many other choices, the solution would have been to encode them both as non-breakable, the more as both were at hand, leaving the choice to the end-user.
>
> ==> Right, but remember, we started off encoding a set of spaces that existed before Unicode (in some other character sets) and implicitly made the assumption that those were the correct set (just like we took punctuation from ASCII and similar sources and only added to it later, when we understood that they were missing things --- generally always added, generally did not redefine behavior or shape of existing code points).
>
Now I understand that what UAX #14 calls “the preferred space for use in numbers” is actually preferred in the table layout you are referring to, because it is easier to code when only the empty decimal separator position uses PUNCTUATION SPACE, while grouping is performed with FIGURE SPACE.

That raises two questions, one of which has been often asked in this thread:

 1. How is FIGURE SPACE supposed to be supported in legacy environments? (UAX #14 mentions both its line breaking behavior and its width, but makes no concessions for legacy apps…)
 2. Why did PUNCTUATION SPACE not be declared non-breakable? (If it had, it could have been re-purposed to space off French punctuation since the beginning of Unicode, and never French users had have a reason to be upset by lack of a narrow non-breaking space.)

>>
>> Current practice in electronic publishing was to use a non-breakable thin space, Philippe Verdy reports. Did that information come in somehow?
>
> ==> probably not in the early days. Y
>
Perhaps it was ignored from the beginning on, like Philippe Verdy reports that UTC ignored later demands, getting users upset.
That leaves us with the question why it did so, downstream your statement that it was not what I ended up suspecting.

Does "Y" stand for the peace symbol?
>
>>
>> ISO 31-0 was published in 1992, perhaps too late for Unicode. It is normally understood that the thousands separator should not have the width of a digit. The allaged reason is security. Though on a typewriter, as you state, there is scarcely any other option. By that time, all computerized text was fixed width, Philippe Verdy reports. On-screen, I figure out, not in book print
>
> ==> much book printing was also done by photomechanically reproducing typescript at that time. Not everybody wanted to pay typesetters and digital typesetting wasn't as advanced. I actually did use a digital phototypesetter of the period a few years before I joined Unicode, so I know. It was more powerful than a typewriter, but not as powerful as TeX or later the Adobe products.
>
> For one, you didn't typeset a page, only a column of text, and it required manual paste-up etc.
>
Did you also see typewriters with proportional advance width (and interchangeable type wheels)? That was the high end on the typewriter market. (Already mentioned these typewriters in a previous e‑mail.) Books typeset this way could use bold and (less easy) italic spans.
>
>>> If you want to do the right thing you need:
>>>
>>> (1) have a solution that works as intended for ALL language using some form of blank as a thousands separator - solving only the French issue is not enough. We should not do this a language at a time.
>>>
>> That is how CLDR works.
>
> CLDR data is by definition per-language. Except for inheritance, languages are independent.
>
> There are no "French" characters. When you encode characters, at best, some code points may be script-specific. For punctuation and spaces not even that may be the case. Therefore, as long as you try to solve this as if it *only* was a French problem, you are not doing proper character encoding.
>
Again, I did not do that (and BTW CLDR is not doing “character encoding”). Actually, to be able to post that blame you needed to cut off all the URLs I provided you with. These links are documenting that i did not “try to solve this as if it only was a French problem[.]”

Here they are again, this time with copy-pasted snippets below.
I wrote: “But as soon as that was set up, I started lobbying for support of all relevant locales at once:”

https://unicode.org/cldr/trac/ticket/11423
https://unicode.org/pipermail/cldr-users/2018-September/000842.html

  * “To be cost-effective, locales using space as numbers group separator should migrate at once from the wrong U+00A0 to the correct U+202F. I didn’t aim at making French stand out, but at correcting an error in CLDR. Having even the Canadian French sublocale stick with the wrong value makes no sense and is mainly due to opaque inheritance relationships and to severe constraints on vetters applying for fr-FR and subsequently reduced to look on helpless from the sidelines when sublocales are not getting fixed.”

  * “After having painstakingly catched up support of some narrow fixed-width no-break space (U+202F). the industry is now ready to migrate from U+00A0 to U+202F. Doing it in a single rush is way more cost-effective than migrating one locale this time, another locale next time, a handful locales the time after, possibly splitting them up in sublocales with different migration schedules. I really believed that now Unicode proves ready to adopt the real group separator in French, all relevant locales would be consistently pushed for correcting that value in release 34. The v34 alpha overview makes clear they are not. ​
    http://cldr.unicode.org/index/downloads/cldr-34#TOC-Migration

    I aimed at correcting an error in CLDR, not at making French stand out. Having many locales and sublocales stick with the wrong value makes no sense any more.
    ​https://www.unicode.org/cldr/charts/34/by_type/numbers.symbols.html#a1ef41eaeb6982d

    The only effect is implementers skipping migration for fr-FR while waiting for the others to catch up, then doing it for all at once.

    There seems to be a misunderstanding: The*locale setting *is whether to use period, comma, space, apostrophe, U+066C ARABIC THOUSANDS SEPARATOR, or another graphic. Whether "space" is NO-BREAK SPACE or NARROW NO-BREAK SPACE is *not a locale setting,* but it’s all about Unicode *design* and Unicode *implementation.* I really thought that that was clear and that there’s no need to heavily insist on the ST "French" forum. When referring to the "French thousands separator" I only meant that unlike comma- or period-using locales, the French locale uses space and that the group separator space should be the correct one. That did *not* mean that French should use *another* space than the other locales using space.”

https://unicode.org/pipermail/cldr-users/2018-September/000843.html
and
https://unicode.org/cldr/trac/ticket/11423#comment:2

  * “I've to confess that I did focus on French and only applied for fr-FR, but there was a lot of work, see ​
    http://cldr.unicode.org/index/downloads/cldr-34#TOC-Growth
    waiting for very few vetters. Nevertheless I also cared for English (see various tickets), and also posted on CLDR-users in a belated P.S. that fr-CA hadn’t caught up the group separator correction yet:
    ​https://unicode.org/pipermail/cldr-users/2018-August/000825.html

    Also I’m sorry for failing to provide appropriate feedback after beta release and to post upstream messages urging to make sure all locales using space for group separator be kept in synchrony.

    I think the point about not splitting up all the data into locales is a very good one.

    There should be a common pool so that all locales using Arabic script have automatically group separator set to ARABIC THOUSANDS SEPARATOR (provided it actually fits all), and those locales using space should only need to specify "space" to automatically get the correct one, ie NARROW NO-BREAK SPACE as soon as Unicode is ready to give it currency in that role.”

Do these recommendations meet your requirements and sound okay to you?
>>>
>>> Do you have colleagues in Germany and other countries that can confirm whether their practice matches the French usage in all details, or whether there are differences? (Including differently acceptability of fallback renderings...).
>>>
>> No I don’t but people may wish to read German Wikipedia:
>>
>> https://de.wikipedia.org/wiki/Zifferngruppierung#Mit_dem_Tausendertrennzeichen
>>
>> Shared in ticket #11423:
>> https://unicode.org/cldr/trac/ticket/11423#comment:15
>
>
> ==> for your proposal to be effective, you need to reach out.
>
Basically we vetters are just reporting the locale date. Beyond that, I’ve already conceded a huge effort in reporting bugs in English data and in communicating on lists and fora, including German (since the current survey that has a very limited scope). I have limited time and resources.

Normally reaching out to all relevant locales is what CLDR can do best, by posting guidelines. by e-mailing (on behalf of CLDR administrator and/or on the public CLDR-users Mail List), and by prioritizing the items on the vetters’ dashboards.

If I can do something else, I’m ready but people should not abuse since I’ve many other tasks I won’t be going to deprioritize any longer. At some point I’ll just start reporting to end-users that we’ve strived to get locale data in synch, but that CLDR ended up rolling back our efforts, alleging other priorities. If that is what you wish, I’d say that there’s no problem for me except that I strongly dislike documenting an ugly mess.
>
>>
>>> (2) have a solution that works for lining figures as well as separators.
>>>
>>> (3) have a solution that understands ALL uses of spaces that are narrower than normal space. Once a character exists in Unicode, people will use it on the basis of "closest fit" to make it do (approximately) what they want. Your proposal needs to address any issues that would be caused by reinterpreting a character more narrowly that it has been used. Only by comprehensively identifying ALL uses of comparable spaces in various languages and scripts, you can hope to develop a solution that doesn't simply break all non-French text in favor of supporting French typography.
>>>
>> There is no such problem except that NNBSP has never worked properly in Mongolian. It was an encoding error, and that is the reason why to date, all font developers unanimously request the Mongolian Suffix Connector. That leaves the NNBSP for what it is consistently used outside Mongolian: a non-breakable thin space, kind of a belated avatar
>> of what PUNCTUATION SPACE should have been since the beginning.
>
> ==> I mentioned before that if something is universally "broken" it can sometimes be resurrected, because even if you change its behavior retroactively, it will not change something that ever worked correctly. (But you need to be sure that nobody repurposed the NNBSP for something useful that is different from what you intend to use it for, otherwise you can't change anything about it).
>
You may wish to look up Unicode’s own PRI#308 background page, where they already hinted they’ve made sure it isn’t.
http://www.unicode.org/review/pri308/pri308-background.html
https://www.unicode.org/review/pri308/
https://www.unicode.org/review/pri308/feedback.html

> If, however, you are merely adding a use for some existing character that does not affect its properties, that is usually not as much of a problem - as long as we can have some confidence that both usages will continue to be possible.
>
Actually, again, there is a problem with NNBSP in Mongolian.

Richard Wordingham reported at thread launch that Unicode have started tweaking that space in a way that makes it unfit for French.

Now since you are aware that this operating mode is wrong, I’d suggest that you reach back to them providing feedback about inappropriateness of last changes. Other people (including me) may do that as well, but I see better chances for your recommendations to get implemented. I say that because lastly I strongly recommended in several pieces of feedback that the math symbols should not be bidi-mirrored on a tilde–reversed-tilde basis, because mirroring these compromises legibility of the tilde symbol in low-end environments relying on glyph-exchange-bidi-mirroring for best-fit display, but UTC took no action, and off-list I was taught that UTC is not interested. Nothing else than that, in private mail. UTC are just not interested, without providing any technical reasons. Perhaps you better understand now why I posted what I suspected to be the reason why UTC is not interested, or was not interested, in supporting a narrow non-breaking space unless Mongolian was encoded and
needed the same for the purpose of appending suffixes (as opposed to separating vowels, which is performed by a similar space with another shaping behavior, and proper to Mongolian). A hypothesis that you firmly dissipated in the wake, but without answering my question about */why UTC was ignoring the demand for a narrow non-breaking space, delaying support for French and heavily impacting French implementations still today/* due to less font support than if that space were in Unicode from version 1.1 on.
>
>>> Perhaps you see why this issue has languished for so long: getting it right is not a simple matter.
>>>
>> Still it is as simple as not skipping PUNCTUATION SPACE when FIGURE SPACE was made non-breakable. Now we ended up with a mutated Mongolian Space that does not work properly for Mongolian, but does for French and other Latin script using languages. It would even more if TUS was blunter, urging all foundries to update their whole catalogue soon.
>
> ==> You realize that I'm giving you general advice here, not something utterly specific to NNBSP - I don't have the inputs and background to know whether your approach is feasible or perhaps the best possible?
>
It is not “my approach”.

Other List Members may wish to help you answer my questions.
>
> As for PUNCTUATION SPACE - some of the spaces have acquired usage in math (as part of the added math support in Unicode 3.2). We need to be sure that the assumptions about these that may have been made in math typesetting  are not invalidated.
>
That adds to the reasons why I’m asking why PUNCTUATION SPACE was not made non-breakable when FIGURE SPACE was. The math usage has probably originated in repurposing that space on the basis of it’s line breaking behavior. I don’t suggest to make it non-breakable now. That deal was broken and will remain broken. Now we must live with NNBSP and get more font support, while trying to stop Unicode from making a mess of it that neither helps Mongolian nor French nor all (other) locales grouping digits with a narrow space.
>
> Not sure offhand whether UTR#25 captures all of that, but if you ever feel like proposing a property change you MUST research that first (with the current maintainers of that UTR or other experts).
>
I have NOT proposed any property change, and PUNCTUATION SPACE or "2008" are NOT found in UTR #25 (Unicode Support for Mathematics).
>
> This is the way Unicode is different from CLDR.
>
Marcel
Received on Sat Jan 19 2019 - 01:34:44 CST

This archive was generated by hypermail 2.2.0 : Sat Jan 19 2019 - 01:34:44 CST