Re: Private Use proposals (long)

From: Doug Ewell (dewell@adelphia.net)
Date: Fri May 24 2002 - 03:16:34 EDT


Michael Everson <everson at evertype dot com> wrote:

> At 08:51 -0700 2002-05-21, Doug Ewell wrote:
>> (Deseret and Shavian were encoded in ConScript; whether that helped
>> get them into Unicode or not, I don't know.)
>
> Certainly not. They were examined on their merits just like anything
else.

Of course they were. By "helped" I didn't mean that the characters
wouldn't otherwise have been worthy of encoding, but that the CSUR
assignments might have resulted in additional usage, which in turn got
the attention of UTC and/or WG2.

I'm trying to examine the passage in TUS 3.0, Section 13.5 (p. 323)
which seems to have caught Mr. Overington's fancy:

<quote>
Promotion of Private-Use Characters.

In future versions of the Unicode Standard, some characters that have
been defined by one vendor or another in the Corporate Use subarea may
be encoded elsewhere as regular Unicode characters if their usage is
widespread enough that they become candidates for general use. The code
positions in the Private Use Area are permanently reserved for private
use -- no assignment to a particular set of characters will ever be
endorsed by the Unicode Consortium.
</quote>

Ignoring the last sentence, because we all seem to be on board with
that, I think the image of the PUA that may have emerged from this is
that of a test bed for proposed characters. In this scenario,
characters are encoded in the PUA *so that* they will gain increased
usage, *so that* the UTC will take note of the increased usage and
respond by "promoting" the character to Unicode. (I think the use of
the word "promotion" in the 13.5 subhead is turning out to be a bad
idea, as it implies a simple and straightforward progression.)

As I mentioned earlier, as far as I know no script or character has
followed this path deliberately -- that is, been encoded in the PUA for
the express purpose of satisfying Unicode's "widespread usage"
requirement. Of course, we all know (don't we?) that a script or
character must satisfy many other criteria as well. Deseret and Shavian
obviously did satisfy those criteria, as well as being judged to have
sufficiently "widespread usage."

Those additional criteria -- not frequency of usage -- are what will
prevent additional Latin ligatures from being "promoted" to Unicode.

To answer (I hope) some of William's other points:

> Well, the ideas are not intended to be quasi-official. Just one end
> user of the Unicode system seeking to use the Private Use Area to
> good effect and putting forward ideas to other end users who might
> like to consider using some of the facilities suggested.

Hooray for that. The PUA is there for just that purpose. However, in
the spirit of using Unicode, please also respect the character-glyph
model, which says (among other things) that a ligature is a glyph
requiring a font rendering, not a character requiring a code point.

> Now, the fact is that Michael suggested a feature named ZERO WIDTH
> LIGATOR specifically for the purpose of ligation and it appears that
> that suggestion has not been accepted, but that a shared solution
> with a code point that can also mean something else has been decided
> upon. Now, I do not know the details of all of this and I certainly
> hope to study the matter more, yet, as someone who is not a linguist
> as such but an inventor and programmer, I have a concern that using
> one code point for two types of meaning rather than one code point
> for each type of meaning is what I call a software unicorn. The
> concept of a software unicorn can be read about on
> http://www.users.globalnet.co.uk/~ngo/euto0008.htm if anyone is
> interested.

I gather from the article that a software unicorn is an unlikely,
perhaps impossible, situation that nevertheless must be handled because
it cannot be completely ruled out. Lots of "defensive" code gets
written to handle such situations, often with a comment like:

    default: // this can't happen, but...

In this context, I think William is saying that it's risky to overload
ZWJ to handle Latin ligation because we can't completely rule out the
possibility that we might need ZWJ to "join" Latin characters the way it
currently joins Arabic characters. This concern can probably be put to
rest by reading the description in Section 13.2 of the Unicode 3.1
Technical Report (UAX #27). The description carefully spells out the
relationship between "cursively connected" and "ligated" renditions and
the roles ZWJ and ZWNJ play in determining the rendition to be used.

> As to strong opposition to encoding additional presentation forms
> for alphabetic characters, well, we live in a democratic society and
> if some people who would like to produce quality printing feel that
> using a TrueType fount with some ligature characters does what they
> want and harms no one else, what exactly is the objection?

Ah, but it *isn't* harmless. It causes problems for normalization. For
homework tonight, read UAX #15, "Unicode Normalization Forms." The key
point for our discussion is that creation of additional canonical or
compatibility equivalents -- such as a new ligature for two existing
characters -- would destabilize the normalization process, because
normalization engines based on different versions of Unicode might
produce different results. Beyond a certain point in time (defined as
Unicode 3.1), no new canonical or compatibility equivalences can be
defined. Because of this, a new "ft" ligature could not carry the
obvious compatibility mapping to 0066 0074; but that would destroy most
of the benefit of encoding it in the first place.

Fortunately, it is not necessary to assign new Unicode characters in
order to put your favorite Latin ligatures in a font. Just create the
"ft" ligature glyph and teach your font to substitute it in place of the
unconnected 0066 and 0074 glyphs. You can assign the ligature to a PUA
code point if you like, but if the internal mapping is done right it
isn't necessary for you to publicize the PUA code point, or for users to
use it directly. (I'm not a font designer, but the font designers on
this list say this is easy.)

> As to whether a Private Use Area implementation has nothing to do
> with formal proposals is not, I feel, so clear cut. Certainly, I do
> not expect the fact that I have suggested four particular code points
> for various padlocks in the Private Use Area to influence a formal
> decision. Yet, by suggesting those four code points, if, at various
> organizations various people are, without making any public
> announcement, trying out a fount with two or four padlock symbols in
> them, then maybe, just maybe, they will use the code points that I
> suggested in my posting. If they do, this would then mean that if
> they try making test applications that make use of the padlock symbols
> expressed as Unicode code points then those test applications may be
> interoperable with test applications made by other researchers, which
> might be of benefit at some stage in the future, if perhaps various
> people make test founts with padlock symbols in them available for
> trials.

In other words, a grass-roots de-facto standard for encoding padlock
symbols in the PUA would sort of emerge from the PUA code point
allocations you have suggested. Personally, I am skeptical it would
work out that way.

BTW, regarding the question of two vs. four padlock symbols: You have
described a common and vexing problem involving the use of symbols as
simultaneous status indicators and prompts. Does that "lock" symbol
mean the object is currently locked, or does it mean I should press here
to lock it (implying that it is currently unlocked)? However, I don't
feel the encoding of padlocks with arrows indicating locking or
unlocking action would reduce the confusion, so if and when I write up a
proposal, it will be for only two characters.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri May 24 2002 - 01:37:53 EDT