L2/12-041R
Title: Overriding Default Properties for PUA -- Core Specification Text Issues
Source: Ken Whistler and Editorial Committee
Date: February 6, 2012
Action: For consideration by the UTC
During the process of text editing and cleanup of the Core Specification for
Unicode 6.1,
the editorial committee ran across a potential inconsistency regarding the
details
of what the standard claims about overriding default properties for PUA
characters.
I am bringing the details of the text issues to the attention of the UTC, for
discussion
and decision.
Current Text for Unicode 6.1
In the latest current draft of the Core Specification for Unicode 6.1, the
relevant
text in Chapter 3, Section 3.5 states (on p. 72):
Default property values are also provided for private-use characters.
Because the
interpretation of private-use characters is subject to private agreement
between the
parties which exchange them, the default property values for those
characters
are overridable by higher-level protocols, to match the agreed-upon
semantics
for the characters. See Section 16.5, Private-Use Characters.
[Note that this is new text added to the draft of Unicode 6.1, in an attempt to
clarify the issue of default property values and PUA. So this isn't already
published text from Unicode 6.0.]
The relevant text in the latest current draft of the Core Specification for
Unicode 6.1,
Section 16.5 states (on p. 555):
The General_Category value of private-use characters in the Unicode Standard
is Private_Use (gc=Co). This value is normatively defined and cannot be
changed
by private agreement. This means that no private agreement can change which
character codes are reserved for private use. However, many Unicode
algorithms
use character properties which are derived by reference to the
General_Category
property. Private agreements may override such derivations for private-use
characters, except where overriding is expressly disallowed in the
conformance
statement for a specific algorithm. In other words, private agreements may
define
which private-use characters should be treated like spaces, digits, letters,
punctuation, and so on, by all parties to those private agreements.
For all properties other than General_Category and the normalization-related
properties, the Unicode Character Database provides default values
for private-use characters. These default property values should be
considered
informative...
[That text is also proposed text new for Unicode 6.1. It is not already
published text
from Unicode 6.0, which only has a very short statement about default property
values for PUA characters.]
Text Issues
The basic text issue raised in the editorial committee is that the draft text in
Section 3.5 is not entirely consistent with the more extensive statement in
Section 16.5. The issue is for General_Category, in particular.
A secondary issue was also raised about the text in Section 16.5: The claim
is that the sentence, "This value is normatively defined and cannot be changed
by private agreement." confuses normativity with overridability. Personally, I
disagree with that assessment, but see how the sentence might be read that
way, so concur that it could use an editorial improvement.
Because the basic text issue concerns Chapter 3, Conformance text, because
in principle, the UTC has already reviewed and agreed upon the text of Chapter 3
for Unicode 6.1, and because the issue is inherently a little tricky, the
editorial
committee deemed it advisable to bring the issue to the attention of the UTC
for discussion and resolution.
Suggested Text Changes
To make the discussion a little easier, I'll provide textual emendation
suggestions
here, which I think may address the problems noted.
First, for Section 3.5, the main issue is that General_Category default values
are not all overridable by private agreement, and there are important
caveats spelled out in more detail in Section 16.5. My suggested emendation
of the text, then, would be:
Default property values are also provided for private-use characters.
Because the
interpretation of private-use characters is subject to private agreement
between the
parties which exchange them, the most default
property values for those characters
are overridable by higher-level protocols, to match the agreed-upon
semantics
for the characters. There are important exceptions for
a few properties. See Section 16.5, Private-Use Characters.
Then a textual correction for the problematical current text in Section 16.5 could
be:
No private agreement can change which character codes are reserved for
private use. However, many Unicode algorithms use the General_Category
property or properties which are derived by reference to the General_Category
property. Private agreements may override the General_Category or
derivations based on it, except where overriding is expressly disallowed in
the conformance statement for a specific algorithm. In other words,
private agreements may define which private-use characters should be
treated like spaces, digits, letters, punctuation, and so on, by all
parties to those private agreements. In particular, when a private agreement
overrides the General_Category of a private-use character from the default
value of gc=Co to some other value such as gc=Lu or gc=Nd, such a change
does not change its inherent identity as a private-use character, but
merely specifies its intended behavior according to the private agreement.
For all other properties the Unicode Character Database also provides default values
for private-use characters. Except for normalization-related properties
these default property values should be considered informative...
Questions
Do those minimal text changes correctly reflect the intention of the UTC
regarding this issue
of overriding default property values for PUA characters?
If so, should the editorial committee proceed with those text changes for the
Unicode 6.1
Core Specification text?
Are there other textual suggestions for how to solve these issues in a different
way?