Accumulated Feedback on PRI #421

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Sun Aug 9 01:10:36 CDT 2020
Name: Eduardo Marín Silva
Report Type: Public Review Issue
Opt Subject: UNIHAN proposed update feedback

1) On the description of many fields, many book names are not italicized:
or instance, on the kCihait field description, the name of the book "Cihai" is not italicized.

  Field           |     Book name not italicized 
------------------------------------------
kCihait           |   Cihai
kDaeJaweon        |   Dae Jaweon
kGSR              |   Grammata Serica Recensa
kHanYu            |   Hanyu Da Zidian
kIRGDaeJaweon     |   Dae Jaweon
kIRGDaiKanwaZiten |   Dai Kanwa Ziten
kIRGHanyuDaZidian |   Hanyu Da Zidian
kMorohashi        |   Dai Kanwa Ziten
kNelson           |   The Modern Reader’s Japanese-English Character Dictionary
kSBGY             |   Song Ben Guang Yun 
                      (the exact spelling of the name must be reviewed)
kTGHZ2013         |   Tōngyòng Guīfàn Hànzì Zìdiǎn
kXHC1983          |   Xiàndài Hànyǔ Cídiǎn

2) Somewhere, it must be stated, that the fields that start with "GB", correspond to 
the "Guobiao standards" of Mainland China (preferably at the corresponding field descriptions).

3) The kGB7 field description, is not clear that its source is made up of two list rather 
than one. Some minor edits should clear it up:

     Old: The "General Purpose Hanzi List for Modern Chinese Language, 
   and General List of Simplified Hanzi" mapping for this character in ku/ten form.
     New: The "General Purpose Hanzi List for Modern Chinese Language," and the
   "General List of Simplified Hanzi" mapping for this character in ku/ten form.

4) The kTang field contains an anomaly in the description:

   ".... An asterisk indicates that the word or morpheme represented in toto 
   or in part by the given character with the given reading occurs more than 
   four times ..."

The word "toto" seems to be a mistake for the word "full".

Feedback above this line was reviewed during UTC #165.

Date/Time: Tue May 25 20:07:51 CDT 2021
Name: Eduardo Marín Silva
Report Type: Public Review Issue
Opt Subject: A few improvements to the field descriptcions of UAX#38


1. Expand the description of kCCCII: Extra information is needed, like the
meaning of the initials, the creators and the age. I propose for it to
read:

  Description | The mapping for this character in hexadecimal, in
  the "Chinese Character Code for Information Interchange" (CCCII). Created
  by the "Chinese Character Analysis Group" (CCAG), with its latest version
  coming out in 1987. Earlier versions of CCCII served as the base for the
  EACC (see kEACC) so many entries are identical between fields. 

The terms between quotes indicate the use of italics. My source is the
Wikipedia article
(https://en.wikipedia.org/wiki/Chinese_Character_Code_for_Information_Interchange)
which in turn cites the book 'CJKV Information Processing'. I lack access
to that book so I ignore the primary source, but if it can be found, it
would be important to cite. The code scheme may have its origin in Taiwan
(ROC).

Finally, in the field description of kEACC, the complete name (East Asian
Character Code for Bibliographic Use) should be spelled with italics, so it
is clear that is what the initials stand for.

2. Specify that the romanization used by kJapaneseKun is 'Hepburn'. I ignore
if the same applies to kJapaneseOn.

3. Make all appearances of the word 'pinyin' be consistently and correctly
spelled as 'pīnyīn' (except, of course, the names of the fields).

4. Include the number of entries for each field: This would add a new row
between 'Syntax' and 'Description'. The purpose would be to estimate the
size of the field, as well the relative coverage of Ideographs it has.

Date/Time: Tue May 25 21:13:11 CDT 2021
Name: Eduardo Marín Silva
Report Type: Public Review Issue
Opt Subject: Lack of documentation relating to other legacy East Asian encodings

Using this article as my source
(https://en.wikipedia.org/wiki/Extended_Unix_Code) I list some encoding
schemes and character sets, not mentioned in UAX#38. This is merely a fyi
type of observation, so it should not affect the text of the annex for now.

Furthermore, I'm not sure if all elements of the list necessarily contain
ideographs or can be considered identical to other fields. In the first
case they can be dismissed, but in the latter, the equivalence must be
clearly documented. If there is another document that clarifies the
relation between different standards, it should be cited in the doc. 

The list in question is:

EUC-JP (Extended Unix Japan), Shift JIS, DEC Kanji (by Digital Equipment
Corporation), HP-15, HP-16, IKIS (by Data General), MacJapanese
(MacOS), Windows-932 (IBM-943), KEIS (by Hitachi), EUC-KR (Extended Unix
Korean/Wansung), UHC (Unified Hangul Code/Windows949/Extended Wansung),
HangulTalk (MacOS) and EUC-TW (Extended Unix Taiwan)

Big5 is only mentioned but not properly explained, and the GB fields are not
correctly attributed to Gubiao Standards.

Date/Time: Fri May 28 09:29:04 CDT 2021
Name: Michel Mariani
Report Type: Error Report
Opt Subject: Corrections for kTotalStrokes


After reporting issues on the Unihan mailing list, I am submitting the 
following corrections for the kTotalStrokes property:

U+28668	𨙨	kTotalStrokes	7
U+2F9DD	𠣞	kTotalStrokes	9

> It does not surprise me that there are some puzzlers lurking in the 
> kTotalStrokes property. The correction for U+28E0F 𨸏 was submitted 
> as public feedback by Jaemin Chung on 2020-09-03, and the correction 
> was applied to the Unihan database. 
> If you don't mind, please submit the following corrections via the 
> Contact Form so that we have a paper trail:
> [...]
> We should be able to get those corrections applied in time for 
> Unicode Version 14.0.


Date/Time: Fri Jun 11 23:46:52 CDT 2021
Name: Eduardo Marín Silva
Report Type: Public Review Issue
Opt Subject: Anomalies in the spelling and format of Unihan field descriptions

I have noticed some anomalies in the spelling or format in different field 
descriptions, particularly "kIRG_GSource".

  kIRG_GSource:

Misspelled Book Name   | Corrected book name
-----------------------------------------
ZhongHua ZiHai         | Zhonghua Zihai
Chinese Encyclopedia   | Encyclopedia of China    (name also needs to be italicized)
Ci Hai                 | Cihai                    (if the "Ci Hai" spelling is preferred, then
                                                  it should be used consistently everywhere)
Ci Yuan                | Ciyuan
Hanyu Dacidian         | Hanyu Da Cidian
Hanyu Dazidian         | Hanyu Da Zidian
Hanyu Fangyan Dacidian | Hanyu Fangyan Da Cidian
Chinese Ancient Ethnic   Research on Ancient      (unless it doesn't refer to the book of the
Characters Research    | Chinese Characters       same name, it should be italicized)

Chinese book titles without pinyin or translations (if added, they should be italicized):

Chinese title          | pinyin [translation]
---------------------------------------------
汉语大字典(第二版)       | Hanyu Da Zidian (second edition)
漢文佛典疑難俗字彙釋與研究  | Hànwén Fódiǎn Yínán Sú Zìhuì Shì Yǔ Yánjiū
                         [Explanation and Research on Difficult and Vulgar Words in Chinese Buddhist Classics]
龍龕手鑑                 | Longkan Shoujian [The Handy Mirror in the Dragon Shrine or The Dragon Shrine/Niche Handbook]

Names of books not italicized:
Siku Quanshu, Yinzhou Jinwen Jicheng Yinde, Standard Telegraph Codebook (revised) 
((the last one should also precede the Chinese name for consistency))

Also, the name of the publisher "Zhuang Liao Songs Research" appears without spaces between words

  kHDZRadBreak:
In the sentence "Indicates that 《漢語大字典》 Hanyu Da Zidian has a radical break 
beginning at this character’s position." place the pinyin first, and italicize 
it to be consistent. Similar suggestions apply to the descriptions of kHanyuPinlu and kHanyuPinyin

  kIRG_KPSource:
Reword the sentence: "... There may, therefore, be erroneous data in the values for 
this field." to say: "... Therefore, there may be erroneous data in the values for this field."

  kIRG_SSource:
Italicize or place quotes on the name "Taishō Shinshū Daizōkyō"

  kIRG_VSource:
Italicize or place quotes on the name "Kho Chữ Hán Nôm Mã Hoá"

Date/Time: Thu Jul 8 11:13:37 CDT 2021
Name: Ken Lunde
Report Type: Public Review Issue
Opt Subject: PRI #421 Feedback

There are a small number of anomalies in earlier versions of the Unihan
database, and it may be useful to document them in UAX #38, mainly in
Section 5, "History":

1) The Version 2.0.0 Unihan database file, Unihan-1.txt, is truncated in the 
middle of the records for U+8BC1 证:

https://www.unicode.org/Public/2.0-Update/Unihan-1.txt 

While this is already documented in Section 5 of UAX #38, it may be helpful
to add that the CD that is included with the Unicode Version 2.0 book has
the same issue, specifically that the files at
{DOS,MAC,UNIX}/MAPPINGS/EASTASIA/UNIHAN.TXT are truncated at the same
position. This would preclude those who have the Unicode Version 2.0 book
from checking the CD on their own (like I did .

2) The Version 3.0.0 Unihan database file, Unihan-3.txt, includes 3,898 records 
for the undocumented kJHJ property:

https://www.unicode.org/Public/3.0-Update/Unihan-3.txt 

I suggest that appropriate entries be added to the table in Section 4.2 of 
UAX #38, specifically the following:

Version 3.0 row: Add "kJHJ" to the "Fields Added" column
Version 3.1 row: Add "kJHJ" to the "Fields Dropped" column

It may also be useful to document this property in Section 5 for the benefit
of those who parse older versions of the Unihan database.

3) The Version 3.1.1 Unihan database file, Unihan-3.1.1.txt, includes the 
following anomalous record at line 246,442:

U+64AC 297

See:

https://www.unicode.org/Public/3.1-Update1/Unihan-3.1.1.txt 

It may be useful to document this in Section 5 for the benefit of those 
who parse older versions of the Unihan database.

4) The Versions 2.0.0, 2.1.2, 3.0.0, and 3.1.0 Unihan database files are not encoded in UTF-8:

https://www.unicode.org/Public/2.0-Update/Unihan-1.txt 
https://www.unicode.org/Public/2.1-Update/Unihan-2.txt 
https://www.unicode.org/Public/3.0-Update/Unihan-3.txt 
https://www.unicode.org/Public/3.1-Update/Unihan-3.1.txt 

It may be useful to document this in Section 5 for the benefit of those 
who parse older versions of the Unihan database.

That is all.

Date/Time: Thu Jul 15 13:24:42 CDT 2021
Name: Ben Scarborough
Report Type: Public Review Issue
Opt Subject: Proposed change for UAX #38

In the current proposed update for UAX #38, the syntax for the kIRG_VSource property is:

V[0-4N]-[023F]?[0-9A-F]{4}

To keep it in line with how regexes are laid out for other IRG source properties, 
a more accurate regex would be:

V[0-4]-[0-9A-F]{4}
| VN-[023F][0-9A-F]{4}

because the V[0-4] sources are always 4 hex digits and the VN sources are always 5.