A software program encountering in a file of plain unicode text a unicode
character in the private use area from U+E000 to U+F8FF needs to make a
decision as to what do about that character.
This may well present a problem.
A software program such as a wordprocessing package that uses a higher level
protocol for formatting control may well not have the problem most of the
time as the meaning of any codes from the private use area that it meets may
be determined by first setting the choice of font before the characters are
encountered. However, such a wordprocessing package may well incorporate
facilities to import files of plain unicode text and also export files from
its own format to plain unicode text, so such software programs may also
encounter the same input problem at times and might also have a need to
output files using some particular usage of the private use area as plain
unicode text without losing the meaning assigned to the private use codes in
that particular application.
I recently found a need to use the private use area and, whilst it was not
necessary for me to do so, I checked with the ConScript registry documents
and I checked about Junicode so that my research would not clash with them.
I felt that this was a combination of both courtesy and self interest.
In the event, I am using U+EB00 through to U+EFFF for the research, which I
mention both in the hope that those people who note uses of the private use
area might like to make a note of it and also so that I may declare an
interest in the matter of the guidance code points concept that I am putting
forward in this document.
It occurs to me that, as time goes on, there will quite possibility come a
time when a software package is trying to read a plain unicode text file and
will meet a code point in the private use area and will have problems in
making a decision as to what to do.
Whilst recognizing that anyone may define private meanings to the code
points in the private use area I feel that it is not really as simple as
that in practice as the word private used in this context does not really
mean private in the usual context of the word private, it means something
such as "unrecognized as exclusively standardized" for the
unicode documentation; for, although the documentation states ".... and do
not have defined, interpretable semantics except by private agreement."
later in the unicode documentation it is stated ".... or they could be
published as vendor-specific character assignments available to applications
and end-users." The use of the word published to some extent contradicts
the notion of private stated in ".... and do not have defined, interpretable
semantics except by private agreement." for publication does not imply
agreement and, at least in England, and perhaps elsewhere, publication of a
matter makes it a matter of public interest.
Now one problem that I envisage could arise is that some large company with
a large marketplace may take it upon itself to define and publish a
character set of its own for the private use area and that character set may
well, at a practical level, reduce the elegant freedom of the private use
area to effectively the say-so of that one particular company. An even
worse scenario could be that several large competing companies would each
define and publish a character set of their own with the result of creating
ambiguity and perhaps effectively squeezing out the use of the private use
area for other than those character sets.
On the other hand, it might be quite helpful for additional character sets
defined by large companies with a large marketplace to be available,
provided that their availability did not cause confusion.
I therefore put forward for discussion a suggestion of the possibility of
guidance code points for the private use area. I do this knowing that it
cannot ever be endorsed by the Unicode Consortium. However, I put the idea
forward in the hope that, maybe after such modification as people on this
list may suggest, the idea might be generally recognized by private
agreement by most people for most applications.
The idea is firstly that each character of U+E801 through to U+E87F would
have a meaning along the following lines to a software package processing a
file of plain unicode text.
"Please recognize until further notice that the codes in the range U+uv00 to
U+xyFF of the private use area have the meanings set out in the
NAME_OF_A_PARTICULAR_REGISTRY registry."
The definition could indeed be somewhat more general so that several
non-contiguous sections of the private use area could be specified for a
particular registry if that facility were desired.
A software author wishing to make use of guidance codes could build into his
or her software the recognition of such guidance codes as he or she chooses.
Such recognition could perhaps be used so as to automatically switch the
choice of display font being used to produce the display of private use area
codes for the relevant section of the document.
There is at least one registry at present. If this idea of guidance codes
find favour in the unicode user community, then that registry, if it so
chooses, could have one code in the range U+E801 to U+E87F that, in so far
as uniquely can mean in this context of private agreement, uniquely
indicates use of its character set. The practice would be that any plain
text document that used the characters from that registry could have that
guidance code near its start and thus increase the chances of the document
being interpreted as intended. In addition, use of guidance codes would
enable characters from two mutually exclusive uses of the private use area
to be used in the same plain unicode text file.
The values of u, v, x, y are intended to be permissible hexadecimal
characters as chosen for each guidance code. The idea is that a software
package could keep track of meanings using a table with 24 entries, one for
U+E000 to U+E0FF, one for U+E100 to U+E1FF, and so on. A guidance code
could provide guidance for any one or more 256 code point blocks of the
private use area. Usually such a guidance code would not try to override
the code points from U+E800 to U+E8FF, though perhaps the possibility should
not be totally agreed out.
The code U+E800 would mean the following.
"In regard to the interpretation of codes in the private use area, please
revert to using the default settings of this software package."
This code is phrased this way so that a designer of a software package that
uses private use area codes entirely of its own or where the designer
chooses not to bother about the guidance code system is in no way whatsoever
even tacitly obligated to include any indication of recognizing guidance
codes in his or her work. Many uses of guidance codes would never use the
U+E800 code.
The codes U+E880 to U+E8FF would be kept in reserve so that, should the
number of registrys ever approach 127 in number then some decision could be
made as to how to proceed at that time.
The question arises of how any person might find out which guidance codes
are in use and the meaning. My suggestion is that each registry keeps a
list of all of the guidance codes and makes it available on the web site of
the registry.
A mechanism to obtain a unique guidance code for a new registry would also
be needed. Would it be permissible for anyone who so wishes to enquire on
this list and be advised of the system and after discussion be allocated a
unique code for the purpose?
I recognize that the introduction of guidance codes could perhaps be viewed
as constraining the freedom of usage of the private use area in some way.
Yet standards constrain absolute freedom in return for benefits. It is a
matter for discussion as to whether the introduction of guidance codes for
the private use area and most people recognizing them most of the time would
be of benefit to the unicode user community, whilst bearing in mind that
everyone is entirely free at any time to totally disregard them for a
particular application.
William Overington
23 April 2001
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT