Re: Private Use Agreements and Unapproved Characters

From: Patrick T. Rourke (ptrourke@methymna.com)
Date: Wed Mar 13 2002 - 08:58:19 EST


Thanks to everyone who has commented, especially John Cowan, Doug Ewell,
and David Starner (I'm on the digest, and so apologize if I haven't
thanked someone who has provided substantial comments). Thanks too to
Mr. Overington, though with Mr. Kaplan I agree that this is a bit too
much work to avoid the minor issue of overlapping PUA uses for my
purposes; I was hoping merely to find an existing registry which might
have some overlap with the user community I'm concerned with. I'm
replying mainly to Mr. Ewell's comments, which are the kind of
counter-arguments I was hoping to be able to consider.

Sorry to be coy, but since I'm writing a proposal (not a Unicode
proposal) to the authors of a couple of Unicode proposals for such a
registry, and since the proposals which would be included in the
registry are ones I did not have any hand in writing, I think it would
be better for me to avoid too much precision until I've got the approval
of the proposal writers (who would also be among the most important of
my targeted users).

> There's no reason it has to be that way. Proposed glyphs are posted on
> the Unicode Web site months in advance of their "go live" date, even
> before the beta period, largely for this reason. I'm sure Unicode-aware
> type designers like John Hudson don't wait until a version of Unicode is
> formally released before they start designing glyphs.

True, but many scholarly communities are small enough that their needs
might not be of interest to type designers with a wider targeted
audience (like Mr. Hudson), and so depend largely upon small
typographers, even amateurs to provide their type. In such cases, it
would seem to me that a registry such as the one I'm suggesting would
help to drive the transition. At any rate, I've already had two type
designers who've done type for the community show interest in such a
registry.

> One important point to remember is that any use or proposed use of the
> PUA, such as ConScript, is strictly up to private organizations, not the
> Unicode Consortium. To be sure, ConScript is the domain of two guys who
> are quite influential in Unicode, but they do not maintain ConScript in
> any official capacity as representatives of Unicode.

Fully aware of this. I'm thinking that this would be an improvement
over the status quo, which is as David Starner suggested, the use of
informal private encodings or escaped entities.

> I would think you could simply use the version number of the Unicode
> Standard. For example, the use of Tagalog would have been conformant to
> this proposed PUA registry until Unicode version 3.2, at which time it
> would have to be removed from the registry because of its introduction
> into Unicode.

This had not occurred to me! The only thing that would militate against
this would be if additional characters were identified which had not yet
been proposed and were proposed at a later date; that would require a
new version number which would not be a Unicode point number, and so
might be distinguished using a letter, etc. (I don't foresee this
happening, but it's better to be safe than sorry, no?).

> Conformance to this registry, especially over a period of time, is up to
> the user community. The presence of a standard is no guarantee that it
> will be followed, or even noticed.

Excellent, this is the problem I was most concerned with. The target
users for the registry would be a small number of electronic scholarly
publishers in the community. The license for the fonts would strongly
recommend that content providers using registry-based fonts would have
to convert their character data to the Unicode-approved codepoints
within say six months of release, and for the target publishers this
wouldn't be a problem. If the distribution sites for the released fonts
all included prominent links to the registry site, and the registry site
provided information on the progress of the characters in the encoding
process, this would I hope drive the adoption of later versions.

So those outside the target user group would at least be made aware of
the process by the license, and a mechanism would be in place to prevent
the dead hand of the older versions of the registry from being quite so
strong.

> Suppose Old Persian Cuneiform is encoded in Patrick's PUA registry next
> week, and that encoding achieves some popularity. Then suppose at some
> later date it is encoded in Unicode, say version 4.1. This will
> necessarily cause the encoding in Patrick's registry to be withdrawn, or
> at least deprecated.

I was thinking deprecated for two versions or two years, whichever was
longer, and then ultimately withdrawn.

> How many people will switch immediately to the
> sanctioned Unicode encoding? How quickly will existing software and
> data be converted? Probably not right away, and the chances for a
> timely conversion are less if the private-use encoding is particularly
> successful, whether or not there are scripts available to help people
> make the conversion.

There would in fact be a published time-table. Of course, if the
private-use encoding became popular enough that it was used OUTSIDE the
targeted group of content providers, this would become an issue. But
since the targeted group of content providers are pretty influential in
the community (e.g., most users in the community would need to get a
font that could be used to read the targeted groups' content), I'm
hoping that their transition would drive the transition of other content
providers.

So obviously this idea is strongly dependent upon the approval and
cooperation of the targeted group of content providers, and so would
have to be abandoned if I did not convince them.

> This is exactly the reason for the "rigorous proposal/review policy"
> mentioned earlier, and perhaps the biggest drawback to the concept of a
> widespread PUA encoding for future Unicode scripts. It usually does
> take a while to get characters encoded in Unicode, not just because
> committees are big and slow and bureaucratic, but because there are real
> decisions to be made that can take a lot of time and research. Rushing
> these characters into use before Unicode and WG2 have finished making
> these decisions could subvert the process and create the dilemmas
> Patrick mentioned.

The point is that the registry would not be "rushing characters into
use," but that they would be characters which were already in use with a
variety of non-standardized methods and which are widely used in print
in the community.

I'm all too aware of why it takes time and research - for example, there
are times when it is very difficult to distinguish a unique character
from a variant letterform. However, there are characters which are
unambiguously represented as entities in an existing private encoding,
and are present as glyphs in existing privately "encoded" fonts (which
are not compatible with one another), and which are clearly not merely
alternate glyphs, but unique characters. These characters are ones
which I would think could be included in such a registry, and would have
a very high probability (I'd guess 90% or more) of being encoded. But
my ability (with the help of others who are familiar with both the
principles of Unicode and with the needs of the community) to "predict"
whether a character would be approved by Unicode and WG2 isn't going to
be 100% accurate. So it would seem to me that the best route would be
to include the proposals in toto and work out what will be done if
certain characters are not encoded.

It seems obvious to me that if all the proposals were rejected for some
reason, the PUA registry would just continue on as-is. But if there were
hard-to-dispute reasons why a particular character of a proposal were
rejected, that character would have to be discontinued in some way.
Would deprecation without deletion make sense for this circumstance?

Does this answer your objections, do you think? (I'm not asking if
you're convinced, only if you think it's something that you'd consider
reasonable, if disagree with).

Another serious issue. The characters are such that I doubt they would
be approved for the BMP. Most of the tools being used by the users in
the community in question (mostly Windows 98 and Mac OS 9 word
processors and web browsers - yes, Mac OS 9 will be a problem anyway)
are not yet able to handle secondary plane characters, at least not
without serious intervention. The PUA code points which would be used
would be in the BMP because use of the secondary plane PUA (I don't
remember the code points, so forgive me for not knowing what plane(s)
they're in) would be obstacles to adoption. The problem will be getting
the targeted content providers to agree beforehand to convert their
content to the approved codepoints when they become available, as the
BMP code points are easier to support. Does anyone have any advice /
prior experience for dealing with this issue?

Finally, are there any existing resources describing / testing support
for PUA characters in existing applications, besides Alan Wood's test
page? Perhaps at ConScript?

Thanks again for taking the time to answer these questions.

Patrick Rourke
ptrourke@methymna.com



This archive was generated by hypermail 2.1.2 : Wed Mar 13 2002 - 09:10:50 EST