From: Ernest Cline (ernestcline@mindspring.com)
Date: Wed Apr 28 2004 - 15:00:51 EDT
> > This is most easily and most naturally controlled by the end users
> > of such introspective setups - simply do not allow conflicting PUA
> > code points on their systems. In such a scenario, the operating
> > system is not forced to make decisions.
>
> That seems unduly limiting. If I want to write a document in two
> scripts, each of which is supported by only one font, both of which use
> the same code point range for their characters, I'm stuck.
True, but not all that common a concern. Formatted documents
enable one to specify which font is intended and the SSP Tags
offer a possible solution for plaintext. The case of multiple
Private Use scripts coming into conflict in the same document
is rare enough that even I, a proponent of a better set of
Private Use characters, am comfortable with having to depend
upon formatting markup to make the distinction, altho something
that could do so in plaintext would be nice, but not necessary.
> There's been a lot of discussion of the PUA in this forum over the time
> I've been on it, but I don't think I've heard anyone make the following
> point:
>
> If you're using the PUA outside a closed system, you're not using
> Unicode.
>
> The PUA is intended for the internal use of applications (or groups of
> applications), or for interchange between applications by private
> agreement of all parties involved. Writing a document in Microsoft Word
> using some exotic script that doesn't have plain-vanilla behavior
> violates this because Microsoft Word isn't a party to the private
> agreement. You either have to write software yourself that does the
> right thing with your characters (you don't have to rewrite Windows, but
> you might have to rewrite Word, which I agree isn't really any more
> realistic).
>
> Therefore, if you're using the PUA out in the "wild" and expecting free
> interchange, you're not using Unicode anymore; you're using a separate
> encoding _based_ on Unicode. In many respects, it's identical to
> Unicode, but it's a separate encoding because it applies additional
> semantics to code points whose definition Unicode leaves open. It seems
> to me that if you want to ensure that documents that make use of the PUA
> are interpreted properly by, say, someone who downloads them from the
> Web, you have to tag their encoding as something other than Unicode, and
> if you want OS vendors to support particular semantics for PUA code
> points, you have to ask them to support this other encoding that gives
> those code points those semantics.
>
> Of course, if you're going to try to standardize a use of the PUA, it
> seems to make just as much sense to standardize the actual characters in
> Unicode in the normal way. If we have a bunch of different
> Unicode-derived encodings out there, that basically resurrects the
> problem Unicode was designed to solve. But I'm beginning to think this
> is already happening in some places.
>
> Using Plane 14 tag characters to identify particular uses of the PUA
> seems very akin to the old ISO 2022 code-switching scheme, and I
> _really_ don't think we want to go there again.
>
> In any event, imposing semantics on PUA code points in documents out in
> the "wild" isn't a "private use," and therefore documents and
> applications doing this are using an ad-hoc Unicode-derived encoding,
> not Unicode. It should be dealt with as such, rather than trying to
> turn Unicode into ISO 2022.
There will always be scripts that Unicode will not support, either because
they are constructed scripts with no real use, rare or ancient scripts that
lack sufficient examples to determine how the script should be encoded,
or are picture fonts. This last we can discount, because the existing
private use area supports them adequately. (The distinction between
the behavior of category Co and So is so minimal as to be not worth
worrying about.) However, for real scripts that Unicode has either
not yet or never will encode, there are currently two options for those
who seek to implement them.
1) Live with the limitations of the PUA and accept that your Private Use
Script will never be able to do the things that other scripts take for
granted.
2) Mimic your script on the basis of already encoded characters in an
existing character encoding. Traditionally that has been done by
created a font that claimed to be a particular legacy encoding, since
fonts were the part of the operating system that had the greatest degree
ease of user customization in a form that was relatively platform
independent.
One reason for using this method was so as to also benefit from the
keyboard already in use, but of late it has become easier to distribute
input methods and there exists a strong current towards being able
to do so interoperably across platforms. Once that barrier crumbles,
we will see more private use scripts that instead of hijacking legacy
encodings, will hijack Unicode unless the mechanism exists for
them to establish their own properties. This mechanism could be
established in one of two ways,
A) The various OS's could provide an easy way to override the
default Private Use characteristics. While the UTC would no doubt
prefer that this solution was adopted, it is extremely unlikely to work.
First of all, some of the properties, such as Line Breaking, that
Unicode defines are implemented by applications, with varying
levels of OS support. So not only would OS's have to adopt a
mechanism for users to describe how they want the PUA used,
applications would have to start consulting it, and there would
need to a way to negotiate the characteristics to be used if the
system knows of multiple ways that a particular PUA codepoint
is defined by various private agreements.
B) Unicode could provide a set of private use characters such as:
E1000;PRIVATE LTR CAPITAL LETTER-1;Lu;0;L;;;;;N;;;E1001;;
E1001;PRIVATE LTR SMALL LETTER-1;Ll;0;L;;;;;N;;;;E1000;
# LineBreak.txt example
E1000;AL # PRIVATE LTR CAPITAL LETTER-1
E1001;AL # PRIVATE LTR SMALL LETTER-1
Now whether U+E1000 gets used for
VERDURIAN CAPITAL LETTER U
or some other character in private use would remain as something
that Unicode would not, and should not care about in the least.
End users would still be hijacking codepoints for their non-standard
uses, but at least they would no longer need to hijack codepoints
that have standard interpretations such as LATIN CAPITAL LETTER U.
Which would you rather have VERDURIAN CAPITAL LETTER U
being mapped as in order to get the support for its casing properties,
LATIN CAPITAL LETTER U or PRIVATE LTR CAPITAL LETTER-1?
Note: Verdurian is one of the scripts in the ConScript Unicode registry.
That registry does not hijack characters to get the desired properties
supported, but has been used here as an example.
Note: U+E1000 and U+E1001 are currently unassigned codepoints.
They are however assigned the properties given above in the
Private Use proposal I have been working on. That proposal has
not reached even rough draft status, but it looks like it will be
contained in the region U+E0F00 to U+E3FFF, excluding support
for Ideographic characters. Ideographic support adds considerably
to the size of the proposal, but such characters can be reasonably
well supported by the large existing PUA blocks so it is not a priority.
This archive was generated by hypermail 2.1.5 : Wed Apr 28 2004 - 16:00:08 EDT