Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database - repost all ascii

From: John Armstrong <john.armstrong.rn3_at_gmail.com>
Date: Tue, 9 Sep 2014 10:02:26 -0400

[Apologies if this issue has already been resolved. I searched the
Unicode.org site for discussions but I only found document dating from 2003
which touches on the issue: andrewcwest_at_alumni.princeton.edu RE: Unicode
4.0.1 Beta Review 1. kRSUnicode Field (
http://www.unicode.org/L2/L2003/03311-errata4.txt)]

A CJK Han character is conventionally viewed as consisting of a radical
plus a residual part or "phonetic". (For a character which is a radical
the residual part is nothing. The term "phonetic", indicating that the
residual part of the character points the pronunciation of the character,
properly only applies to 90-95% of characters, but it applies in the
examples below. )

The two parts of a character each consist of a specific arrange of strokes,
and together account for all the strokes in the character. In particular,
the number of strokes in the radical portion plus the number of strokes in
the residual portion equals the total number of strokes in the character.
The stroke count of a radical combined with a residual part is not always
the same as the stroke count of the radical appearing on its own, but may
be slightly or significantly less due to a minor or major abbreviation. (A
radical may have several forms which are used in different positions of the
whole character, say left or right side vs. top or bottom. These variants
may have the same or different stroke counts.)

Because of abbreviated variants the total stroke count for a character
cannot be always be gotten by adding the stroke count of the radical in its
standalone form to the stoke count of the residual portion. However, it
can always be gotten by subtracting the stroke count of residual portion
from the total stroke count of the character. The Unihan database provides
the exact data needed to make this calculation:

kTotalStrokes: stroke count for full character

kRSUnicode: radical number and residual stroke count (in format
<rad_num>['].<res_strokes>, where optional ' (apostrophe) in the latter
indicates a widely used abbreviation for the radical with a significantly
different appearance and a significantly (-3 or more) lower stroke. (But
not all such forms are so marked - examples are forms with radical numbers
140, 162,163,170. It may be that the marker is limited to abbreviations
uses in Simplified as opposed to Traditional Chinese characters.)

The formula is simply:

radStrokes(K) = kTotalStrokes(K) - kRSUnicode(K).resStrokes

This formula generally gives correct results, but not always. In fact,
according to reasonably accurate heuristic test I ran it produces incorrect
(or at least "suspicious") results in 2236 of the total 74911, or 3%, of
characters in the database that have both kTotalStrokes and kRSUnicode
data. Moreover the rate is significantly higher for the characters in the
BMP than in the SIP - in fact it is really negligible in the latter. Most
importantly it is 8.2% in the block containing all the most widely used
characters, the base CJK Unified Ideographs block. The numbers for all
the blocks are as follows:

RANGE TOTAL* SUSPICIOUS PCT

BMP
BASE 20941 1727 8.2
CMP 302 29 9.6
CMPS 4 0 0.0
EXTA 6582 469 7.1

SIP
EXTB 42711 6 0.0
EXTC 4149 5 0.1
EXTD 222 0 0.0

TOTAL 74911 2236 3.0

*with both kTotalStrokes and kRSUnicode

Some of the suspicious cases are actually valid, but I believe that vast
majority are truly incorrect, and that the rate of incorrect radical stroke
counts implied by kTotalStrokes and kRSUnicode is at least 6-7% for the
base CJK Unified Ideographs block.

Here are a couple examples where the stroke counts are fairly small and the
radicals and the residual parts ("phonetics") widely occurring. The first
illustrates the situation where the radical stroke count implied by
kTotalStrokes and kRSUnicode is greater than the correct value, and the
second that where the implied radical stroke count is less than the correct
value. (The second situation is much more common than the first,
accounting for at least 80% of the "suspicious" items.

Example 1: character U+4E9B 'a few'

kTotalStrokes = 8
kSRUnicode = 7.5
radical number = 7 'two'
residual strokes = 5
implied radical stroke count = 3 (8 - 5)
correct radical stroke count = 2
diff = 1 (implied count one too high)

The residual portion of the character occurs as an independent character
U+6B64 'this, these'. Its kTotalStrokes is 6 and its kRSUnicode = 77.2.
The radical #77 'stop' has 2 strokes in its standalone form, so the
residual stroke count of 2 is consistent with a total count of 6. In the
main character U+4E9B, therefore, the residual part has effectively lost a
stroke in composition, being reduced from 6 to 5.

 (This actually seems to be the norm with this phonetic. Other examples
are U+4F4C, U+5470, U+5472, U+59D5, U+67F4, U+75B5, U+7689, U+7725 and I'm
sure more.)

Example 2: is character U+5040 'distinguished person; English person'

kTotalStrokes = 10
kSRUnicode = 9.9
radical number = 9 'person'
residual strokes = 9
implied radical stroke count = 1 (10 - 9)
correct radical stroke count = 2
diff = -1 (implied count one too low)

Again the residual portion occurs as an independent character U+82F1
'distinguished; English'. Its kTotalStrokes is 8 and its kRSUnicode is
140.5. Radical #140 'grass' has 6 strokes in its standalone form but as
the radical component of a larger character is always abbreviated to a form
with 3 strokes. That is the case here. Thus residual count of 5 in the
kRSUnicode of U+82F1 is consistent with the kTotalStrokes of 8 for the
character. This count of 8 agrees with the residual count for the full
character U+5040 implied by its 10 kTotalStrokes, but is one less than the
9 residual strokes specified in the kRSUnicode.

In both examples the discrepancy between kTotalStrokes and KRSUnicode arise
out of different residual stroke counts and have nothing to do with the
radical, be it its identity, the variant used, or the stroke count. While
there are some exceptions, this is clearly the normal situation. It also
makes sense. Most disagreements on stroke counts have to do with the
residual as opposed to the radical portion of the characters. (Question of
radical counts usually involve cases where the radical has more than one
form in a given context, for example rad #140 'grass', which has 6 strokes
in its full form but variants in the top position context with 3 and 4
strokes. Less commonly, they involve cases where the radical is fused with
the residual portion or even lost altogether as part of a historical
simplification.)

As mentioned above, discrepancies of the type illustrated by the second
example (implied radical stokes higher than correct) are much more common
than discrepancies of the type illustrated by the first example (implied
radical stokes less than correct). To the extent the discrepancies involve
the residual stroke counts and have nothing to do with the radical, the
situation can be reframed in terms of residual stroke counts as:

Dominant pattern: the residual stroke count specified in kRSUnicode is
greater than that implied by kTotalStrokes (5 vs. 6 strokes in Ex. 2)

Minor pattern: the residual stroke count specified in kRSUnicode is less
than that implied by kTotalStrokes (9 vs. 8 strokes in Ex. 1)

The results of the heuristic test indicate that the great majority of cases
of both patterns involve differences in residual stroke counts of one or
occasionally two strokes. I believe this is in line with the variations in
stroke counting that are observed in actual practice (dictionaries etc.).
Still, the question needs to be asked, do the discrepancies (which occur in
5% of all characters in the base Unicode character set) simply represent
different, but more or less equally valid, ways of counting strokes, or are
they errors that need to be corrected or at least addressed in some way?

In my view the answer depends on a more specific question: are
kTotalStrokes and KRSUnicode intended to be consistent? That is,
regardless of what exact count is chosen for a given character, should both
terms reflect the same count?

Here is how the two fields are described in the document Proposed Update to
Unicode Standard Annex #38 Unicode 6.0.0 draft 1 (
http://www.unicode.org/reports/tr38/tr38-8.html):

kTotalStrokes:

"The total number of strokes in the character (including the radical).
_This value is for the character as drawn in the Unicode charts_."

kRSUnicode:

"A standard radical/stroke count for this character in the form
"radical.additional strokes". The radical is indicated by a number in the
range (1..214) inclusive. An apostrophe (') after the radical indicates a
simplified version of the given radical. The "additional strokes" value is
the residual stroke-count, the count of all strokes remaining after
eliminating all strokes associated with the radical.

This field is also used for additional radical-stroke indices where either
a character may be reasonably classified under more than one radical, or
alternate stroke count algorithms may provide different stroke counts.

_The first value is intended to reflect the same radical as the kRSKangXi
field and the stroke count of the glyph used to print the character within
the Unicode Standard_.

When I talk about kRSUnicode I always mean the first value in the list.
Similarly my heuristic test always uses the first value. I mention this
because of the way the last paragraph of the description refers
specifically to this value.

Both descriptions tie the specific values of the two fields to the specific
glyphs used to draw/print the character in the Unicode charts
(kTotalStrokes "character as drawn in the Unicode charts", kRSUnicode "the
glyph used to print the character within the Unicode Standard"). Given
this, the answer to the question of whether the two fields should be
consistent certainly seems to be yes. And this means that the cases where
they are not, i.e. where there are discrepancies, are errors.

If it's conceded that the discrepancies do reflect errors, then I think it
also needs to be conceded that they need to be addressed in some way. The
most straightforward thing would be to go through all the cases and change
either kTotalStrokes or kRSUnicode to (a) be consistent and (b) offer
values appropriate to the specific glyph used in the standard.

Given that kRSUnicode is used for ordering characters in the block (the
radical number being used to determine what radical it is listed after and
the residual count being used to determine where after the radical it
appears - except for ties, which are ordered arbitrarily), while to the
best of my knowledge kTotalStrokes is not used for anything within the
standard, the most practical thing would be to keep the existing kRSUnicode
value wherever it is not obviously incorrect and adjust the kTotalStrokes
to be consistent with it.

But this involves changing a lot of data - including data for the most
widely used characters, those in the base CJK Unified Ideograph block -,
and may break systems that use the existing values.

An alternative which I would suggest is to create a new field which could
be called kRSUnicode2 or something similar and would have not two but three
subfields (not counting apostrophe)

<rad_num>['].<rad_strokes>.<res_strokes>

where the first and third subfields are the same (same meaning, same
values, barring clear errors) as in kRSUnicode and the added second
subfield is the number of strokes in the radical as it appears in the
character.

This new field would contain all the stroke count information that's needed
for a character, including not only the residual strokes but also the
radical strokes and, via calculation (adding the two values), the total
strokes. The last can be compared with kTotalStrokes, but does not depend
on it, and may be different.

(Note that the presence of apostrophe would become largely predictable from
a comparison of the radical stroke count in the first subfield with the
count for the radical as a standalone character. In fact it would only be
necessary to retain it if its purpose was not simply to indicate
significantly abbreviated radicals in general but specifically to indicate
forms that are used in Chinese Simplified but not the corresponding
Traditional ones.)

I see the following advantages to this approach:

(1) No constraints are placed on existing kTotalStrokes or kRSUnicode
values - they can be left as is or changed at any point without
implications for the new kSRUnicode2 values

(2) No systems that use the existing kTotalStrokes or kRSUnicode fields
will break or be affected in any way (though they could be changed to use
the self-standing kRSUnicode2 field with possibly more satisfactory results)

(3) All stroke information for a character is contained in a single field,
kRSUnicode2, and can't be inconsistent (though it can be wrong)

(4) Stroke counting differences between fields can be directly found and
quantified (particularly, by comparing the partial stroke information in
kTotalStrokes and/or kRSUnicode to the full information in kRSUNicode2

(5) An initial version of the full set of the new kRSUnicode2 field values
could be generated algorithmically from kTotalStrokes and kRSUnicode and
then revised by human inspection focusing on the proportionally small
amount (8% in the base block, 3% overall) of "suspicious" cases detected
by a heuristic procedure (which I'm sure could be made more accurate than
the one I used, for example by bringing in more existing information
sources)

The main disadvantages I see are:

(1) Confusion arising from the overlap between the old and new fields

(2) The work involved (though anything other than dismissing or postponing
the issue is going to involve work)

If there is interest I will be glad to share the results of my heuristic
test and the program (python) I used to produce them.

John Armstrong
Cambridge MA

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Tue Sep 09 2014 - 09:03:45 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 09 2014 - 09:03:47 CDT