Re: NNBSP from Asmus Freytag via Unicode on 2019-01-18 (Unicode Mail List Archive)

From: Asmus Freytag via Unicode <unicode_at_unicode.org>
Date: Fri, 18 Jan 2019 10:20:22 -0800

On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:

Covering existing character sets (National, International and Industry) was an (not "the") important goal at the time: such coverage was understood as a necessary (although not sufficient) condition that would enable data migration to Unicode as well as enable Unicode-based systems to process and display non-Unicode data (by conversion).

I’d take this as a touchstone to infer that there were actual data files including standard typographic spaces as encoded in U+2000..U+2006, and electronic table layout using these: “U+2007 figure space has a fixed width, known as tabular width, which is the same width as digits used in tables. U+2008 punctuation space is a space defined to be the same width as a period.”
Is that correct?

May I remind you that the beginnings of Unicode predate the development of the world wide web. By 1993 the web had developed to where it was possible to easily access material written in different scripts and language, and by today it is certainly possible to "sample" material to check for character usage.

When Unicode was first developed, it was best to work from the definition of character sets and to assume that anything encoded in a give set was also used somewhere. Several corporations had assembled supersets of character sets that their products were supporting. The most extensive was a collection from IBM. (I'm blanking out on the name for this).

These collections, which often covered international standard character sets as well, were some of the prime inputs into the early drafts of Unicode. With the merger with ISO 10646 some characters from that effort, but not in the early Unicode drafts, were also added.

The code points from U+2000..U+2008 are part of that early collection.

Note, that prior to Unicode, no character set standard described in detail how characters were to be used (with exception, perhaps of control functions). Mostly, it was assumed that users knew what these characters were and the function of the character set was just to give a passive enumeration.

Unicode's character property model changed all that - but that meant that properties for all of the characters had to be determined long after they were first encoded in the original sources, and with only scant hints of the identity of what these were intended to be. (Often, the only hint was a character name and a rather poor bitmapped image).

If you want to know the "legacy" behavior for these characters, it is more useful, therefore, to see how they have been supported in existing software, and how they have been used in documents since then. That gives you a baseline for understanding whether any change or clarification of the properties of one of these code points will break "existing practice".

Breaking existing practice should be a dealbreaker, no matter how well-intentioned a change is. The only exception is where existing implementations are de-facto useless, because of glaring inconsistencies or other issues. In such exceptional cases, deprecating some interpretations of character may be a net win.

However, if there's a consensus interpretation of a given character the you can't just go in and change it, even if it would make that character work "better" for a given circumstance: you simply don't know (unless you research widely) how people have used that character in documents that work for them. Breaking those documents retroactively, is not acceptable.

A./

Received on Fri Jan 18 2019 - 12:20:33 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 18 2019 - 12:20:33 CST