Re: On the possibility of guidance code points for the Private Use Area

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Apr 23 2001 - 15:15:01 EDT


Michael Everson and Doug Ewell already pointed out that neither the
UTC nor SC2/WG2 is going to endorse any standard interpretation of
any code points in the private use areas -- including any proposal
to specify certain code points as "guidance code points" for alternate
registries of private usage.

I'd like to emphasize that even further. What it means for a character
encoding organization per se to defend its designation of a code
point as "private use" is that it must *absolutely* guarantee that
that organization will never assign any standard interpretation to
that code point. To assign a character, or even some kind of generic
non-character-specific character semantics to a private use code point
would necessarily run afoul of someone's private usage of that code
point to indicate something else entirely.

> Whilst recognizing that anyone may define private meanings to the code
> points in the private use area I feel that it is not really as simple as
> that in practice as the word private used in this context does not really
> mean private in the usual context of the word private, it means something
> such as "unrecognized as exclusively standardized" for the
> unicode documentation; for, although the documentation states ".... and do
> not have defined, interpretable semantics except by private agreement."
> later in the unicode documentation it is stated ".... or they could be
> published as vendor-specific character assignments available to applications
> and end-users." The use of the word published to some extent contradicts
> the notion of private stated in ".... and do not have defined, interpretable
> semantics except by private agreement." for publication does not imply
> agreement and, at least in England, and perhaps elsewhere, publication of a
> matter makes it a matter of public interest.

This is a quibble about the usage of the terms "private" and "public".
Any shared private use is by definition public. And of course, if you
publish private usage of code points and encourage others to make use
of your interpretation, that is even more public.

The point for the standardizing organizations is that the interpretation
of private use code points is not *standardized* -- that is what makes
them "private use" -- not whether there is any public, shared usage of
particular "private" interpretations of PUA code points.

>
> Now one problem that I envisage could arise is that some large company with
> a large marketplace may take it upon itself to define and publish a
> character set of its own for the private use area and that character set may
> well, at a practical level, reduce the elegant freedom of the private use
> area to effectively the say-so of that one particular company. An even
> worse scenario could be that several large competing companies would each
> define and publish a character set of their own with the result of creating
> ambiguity and perhaps effectively squeezing out the use of the private use
> area for other than those character sets.

As a practical matter this is not a serious issue. All of the large
software companies that have a history of developing corporate character
encodings want that history firmly behind them. That is why they all
belong to the Unicode Consortium and continue to work to ensure that
all characters needed for public interchange are included in the Unicode
Standard.

Of course, companies like Microsoft and IBM do have their own private
use interpretations of PUA code points -- often for internal cross-mapping
tables, for example. But they have no intention of making those internal
designations into new corporate "character sets" and then trying to
foist them on the marketplace.

> There is at least one registry at present. If this idea of guidance codes
> find favour in the unicode user community, then that registry, if it so
> chooses, could have one code in the range U+E801 to U+E87F that, in so far
> as uniquely can mean in this context of private agreement, uniquely
> indicates use of its character set. The practice would be that any plain
> text document that used the characters from that registry could have that
> guidance code near its start and thus increase the chances of the document
> being interpreted as intended. In addition, use of guidance codes would
> enable characters from two mutually exclusive uses of the private use area
> to be used in the same plain unicode text file.

I think you are overestimating the need for this kind of mechanism.
I see no indication that there is likely to be a proliferation of such
registries any time soon. Other than the ConScript registry, most usage
I know of for private use code points falls into one of two categories:

   A. Intentional private use for non-exchanged data, e.g., the
      cross-mapping table usage I mentioned above, font-internal
      tables, or for various
      other kinds of internal markers. (For example, I make use of
      three PUA code points in a collation weighting algorithm to
      indicate virtual combining marks for secondary weight variation --
      but that is explicitly not intended for interchange as characters
      outside of the private context.)

   B. Groups working on more-or-less experimental encodings of
      difficult or historic scripts not yet standardized. The PUA
      code points give them a mechanism to try things out and to
      interchange data. But in most of these instances, the intent
      is not to codify and register some private usage forever; these
      efforts are mostly just stepping stones to the eventual
      development of *standardized* encodings of the scripts in
      question in the Unicode Standard and in 10646.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT