1. | What happened? | ||||||||
| |||||||||
2. | Proposals | ||||||||
|
In L2/03-043, Mark Davis observes a discrepancy between the Decimal_Digit_Number property (defined as characters having the Nd general category), and the Decimal_Digit property (defined as the characters having a non-empty value in field 6 of UnicodeData.txt).
It turns out that the problem is slightly more complicated.
In a nutshell, we managed to develop two definitions for the decimal digit property:
Unicode Version | Definition 1 | Definition 2 |
3.0 | Table 4-6 | |
3.1, 3.2 | Table 4-6 | DerivedProperties.html |
4.0 | DerivedProperties.html |
The definition in Table 4-6 does not give the decimal digit property to the superscripts and subscripts, while the definition in DerivedProperties.html does give that property to those characters. The discrepancy is limited to the superscript and subscript characters (U+00B2, U+00B3, U+00B9, U+2070, U+2074..U+2079, U+2080 .. U+2089).
The following subsections explain this in more detail.
Section 4.6, page 89, states:
Decimal digits form a large subcategory of numbers consisting of those digits than can be used to form decimal-radix numbers. They include script-specific digits, not characters such as Roman numerals ([...]), subscripts, or superscripts. [...]
Table 4-6 provides the lists of characters having the decimal digit property, and does not include the superscripts and subscripts.
UnicodeData.html states for field 6, Decimal digit value:
This is a numeric field. If the character has the decimal digit property, as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented with an integer value in this field.
And for field 7, Digit value:
Digit value. This is a numeric field. If the character represents a digit, not necessarily a decimal digit, the value is here. This covers digits which do not form decimal radix forms, such as the compatibility superscript digits.
PropList.txt lists under Decimal Digit the same set of characters as Table 4-6.
So far, everything is consitent, and the superscripts and subscripts do not have the Decimal Digit property.
The only potential problem is that in UnicodeData.html, the superscripts and subscripts have a non-empty value in field 6 (and those are the only such characters without the Decimal Digit property). However, this can accomodated by interpreting the description of field 6 literally, and in particular by reading “if the character does not have the decimal digit property, the value of this field is undefined”
Unicode 3.1 (via UAX 27) does not mention any relevant change to chapter 4.
UnicodeData.html gives the same description of fields 6 and 7 in UnicodeData.txt.
PropList.txt was extensively changed, and DerivedNumberType.txt was introduced. UAX 27 describes those changes as affecting the form of the UCD but not as affecting the content of the UCD (at least for the existing characters).
Thus, one would be justified in believing that the superscripts and subscripts still do not have the Decimal Digit property, and in fact, that none of the new characters in 3.1 have that property (since Table 4-6, which is defining the property, was not modified).
However, DerivedProperties.html provides a contradicting definition of Decimal Digits: here, the characters having that property as defined as those having non-empty fields 6, 7 and 8 in UnicodeData.txt. Under that definition, the superscripts and subscripts have the Decimal Digit property, as well as the new characters U+1D7CE .. U+1D7FF MATHEMATICAL BOLD DIGIT ZERO .. MATHEMATICAL MONOSPACE DIGIT NINE. DerivedNumericType.txt is built following that definition.
The bottom line is that Unicode 3.1 is ill-formed: we have two non-equal definitions for Decimal Digit.
The situation in Unicode 3.2 is fundamentally the same as in Unicode 3.1.
The main change (so far) in Unicode 4.0 is in chapter 4. The October 02, 2002 draft uses the same words to describe Decimal Digit, but removes table 4-6 altogether.
UCD.html, which replaces UnicodeData.txt, still describes field 6 as:
If the character has the decimal digit property, as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented with an integer value in this field.
However, we no longer have a definition in Chapter 4!
UCD.html, which also replaces DerivedProperties.html, still provides the alternal definition ”non-empty fields 6, 7, and 8 in UnicodeData.txt”
The conflict between the two definitions is now “resolved”, but at the cost of a change of the status of the superscripts, subscripts and mathematical digits.
The first proposal is to restore the 3.0 situation concerning superscripts and subscripts for the Decimal_Digit_Number property. Given our current definition of the property, this is best achieved by making field 6 of UnicodeData.txt empty for those characters.
While researching this problem, I discovered that the characters U+1D7CE .. U+1D7FF MATHEMATICAL BOLD DIGIT ZERO .. MATHEMATICAL MONOSPACE DIGIT NINE have the decimal digit property.
It seems to me that this choice is less than optimal. For example, a likely use of U+1D7CE MATHEMATICAL BOLD DIGIT ZERO is to denote the neutral element of a binary function that forms a group, denoted by the infix “+”, rather than the numeric value 0 itself. In general, those characters act more as symbols than digits, much like the MATHEMATICAL LETTERS act more like symbols than like letters (and for example have no case mapping).
The proposal is to remove the decimal digit property from U+1D7CE .. U+1D7FF.
Decimal Digit has no impact on U+06DD ENH OF AYAH, nor on the year and number sign: all those have their scope defined by general category Nd.
However, the scope of FRACTION SLASH is defined by “The standard form a fraction built using fraction slash is defined as follows: Any sequence of one or more decimal digits, followed by the fraction slash, followed by any sequence of one or more decimal digits.”. Here “decimal digit” is a little bit ambiguous: it is not clearly “those characters with the Decimal Digit property”, but there is no other good interpretation.
The proposal is to define the scope of U+2044 as “adjacent Nd” instead, to match the Arabic characters. After all SUBSCRIPT ONE, FRACTION SLASH, SUPERSCRIPT TWO is hardly interesting.
Back to the original observation, we still need to decide if we want the properties Decimal_Digit_Number (i.e. general category Nd) and Decimal_Digit (i.e. field 6 non-empty) to be one and the same, or to happen to cover the same characters in 4.0, or to happen to cover different characters in 4.0. With the first proposal above, the difference if removed for the superscripts and subscripts. With the second proposal, we would introduce a difference for the mathematical digits; we could compensate for it by changing the general category of those characters to No.
The proposal is to make the mathematical digits No, to describe Numeric_Type/Numeric_Value in UCD.html like this:
(6) if the character has the general category Nd, the value of that digit is represented with an integer value in this field. |
and to define DerivedNumericType in UCD.html like this:
The property value is based on the general category of the character and the contents of UnicodeData.txt, fields 6 through 8:
|
Author: Eric Muller
Revision | Date | Comments |
March 4, 2003 | First version |