Emmanuel Vallois
PRI 198 Proposed Update
UAX #42: Unicode Character Database in XML
PRI 196 Proposed Update
UAX #38: Unicode Han Database (Unihan)
Feed back
UAX 38
First, Some
little errors I spotted here and there (mostly editorial):
4.1
Alphabetical Listing: Last sentence before tables explaining fields in
detail reads
“The fields
covered in the table are: kAccountingNumeric, kAlternateJEF, …”
but the
table in 4.2
Listing by Date of Addition to the Unicode Standard says kAlternateJEF has
been dropped in Unicode 3.1 (and I haven’t been able to find traces of it in
archived versions), and for sure it isn’t covered in the table.
UAX #42, 4.4.21
Unihan properties
This has to
do with some strange things in patterns, and discrepancies with those of UAX
#38. I also verified that one attribute marked as a list {} is indeed indicated
as being space-separated in UAX #38.
UAX 42 |
UAX 38 |
Comment |
Proposed
resolution |
code-point-properties &= attribute kAlternateHanYu { text }? #old |
N/A,
kHanYu: ^[1-8][0-9]{4}\.[0-3][0-9][0-3]$ |
These are
no more included in the regular Unihan database, but they should probably be
given the same regular expression as their non “Alternate” counterpart (viz.
kHanYu, kKangXi, kMorohashi). |
Copy
Syntax from kHanYu in UAX 38: |
code-point-properties &= attribute kAlternateJEF { text }? #old |
N/A, no idea
what this field was for, there is no kJEF field |
|
|
code-point-properties &= attribute kAlternateKangXi { text }? |
N/A,
kKangXi: [0-9]{4}\.[0-9]{2}[01] |
Copy
Syntax from kHanYu in UAX 38: |
|
code-point-properties
&= attribute kAlternateMorohashi { text }? |
N/A,
kMorohashi: [0-9]{5}\'? |
Copy
Syntax from kHanYu in UAX 38: |
|
code-point-properties &= attribute kCNS1992 { xsd:string
{pattern="[123]-[0-9A-F]{4}"} }? |
^[1-9]-[0-9A-F]{4}$ |
The
regular expression is more restrictive than in UAX 38 |
Copy pattern
to UAX 38 Syntax: |
code-point-properties &= attribute kCantonese { list { xsd:string
{pattern="[a-z]+[1-6]"} +}}? |
^[a-z]{1,6}[1-6]$ |
The
regular expression is more restrictive in UAX 38 |
Copy
Syntax from kHanYu in UAX 38: |
code-point-properties &= attribute kCCCII { xsd:string
{pattern="[0-9A-F]{6}"} }? |
Delimiter:
space |
There is
no multiple values for this field |
Change
UAX 38 to |
code-point-properties &= attribute kCheungBauer { text }? |
^[0-9]{3}\/[0-9]{2};[A-Z]*;[a-z1-6\[\]\/,]+$ Delimiter:
space |
The
regular expression is more restrictive in UAX 38, which is still not very
detailed on the last part, therefore I suggest an improvement |
To better
show the structure of the last part of the field, change both UAX 38 to And UAX42
to { list {
xsd:string {pattern="^[0-9]{3}\/[0-9]{2};[A-Z]*;(\[ng\])?[a-z]{1,6}[1-6](\/[1-6])*(,(\[ng\])?[a-z]{1,6}[1-6](\/[1-6])*)*"}
+}}? |
code-point-properties &= attribute kCheungBauerIndex { list { xsd:string
{pattern="[0-9]{3}\.[0-9]{2}"} +}}? |
^[0-9]{3}\.[01][0-9]$ |
The
regular expression is more restrictive in UAX 38 |
Copy
Syntax from kCheungBauerIndex in UAX 38: |
code-point-properties
&= attribute kCompatibilityVariant { "" | xsd:string
{pattern="U\+2?[0-9A-F]{4}"} }? |
|
The empty
string is due to the presence of the attribute on every character, despite only 1,3% of all characters have a value
for that attribute. |
Don’t
output this attribute by default, it contributes to more than 5% of the size
of the file ucd.unihan.grouped.xml uselessly. Change
UAX 42 to |
code-point-properties
&= attribute kDaeJaweon { xsd:string
{pattern="[0-9]{4}\.[0-9]{2}[0158]"} }? |
Syntax ^[0-9]{4}\.[0-9]{2}[01]$ |
The
regular expression is more restrictive in UAX 38 |
Copy
Syntax from kDaeJaweon in UAX 38: |
code-point-properties &= attribute kIRG_JSource { ""
| xsd:string {pattern="(0|1|3|(3A)|4|A|(ARIB)|K)-[0-9A-F]{4,5}"} | xsd:string
{pattern="J0-[0-9A-F]{4}"} | xsd:string
{pattern="J1-[0-9A-F]{4}"} | xsd:string
{pattern="J3-[0-9A-F]{4}"} | xsd:string
{pattern="J3A-[0-9A-F]{4}"} | xsd:string
{pattern="J4-[0-9A-F]{4}"} | xsd:string
{pattern="JA-[0-9A-F]{4}"} | xsd:string
{pattern="JH-[0-9A-Z]{6,7}"} | xsd:string
{pattern="JK-[0-9]{5}"} | xsd:string
{pattern="JARIB-[0-9A-F]{4}"} }? |
^J((([0134AK]|3A|ARIB)-[0-9A-F]{4,5})|(H-(((IB|JT|[0-9]{2})[0-9A-F]{4}S?))))$ |
The
regular expression is more compact and precise in UAX 38. Moreover, the first
pattern alternative is incorrect as it lacks the leading J, and all others
are contained in UAX 38 syntax. This remark goes for all kIRG_?Source fields. |
Change UAX
42 to use UAX 38 regular expression. |
code-point-properties &= attribute kIRG_MSource { "" | xsd:string
{pattern="MAC[0-9]{5}"} | xsd:string
{pattern="MAC-[0-9]{5}"} }? |
^MAC-[0-9]{5}$ |
Incorrect
alternative in UAX 42. |
Change
UAX 42 to {
"" | xsd:string {pattern="MAC-[0-9]{5}"} }? |
code-point-properties
&= attribute kMandarin { list { xsd:string
{pattern="[A-ZÜ̈]+[1-5]"} <-- There is Ü + \u308 | xsd:string
{pattern="[a-z̀́̄̈̌]+"} }}? |
^[a-z\x{300}-\x{302}\x{304}\x{308}\x{30C}]+$ |
In UAX 42
the old expression seems to survive in the first pattern alternative. |
Delete in
UAX 42 the first pattern alternative, i.e. conform to UAX 38 syntax : { list
{ xsd:string
{pattern="[a-z̀́̄̈̌]+"} }}? |
code-point-properties
&= attribute kVietnamese { list { xsd:string
{pattern="[A-Za-zŕ-ừ-̛̣̆̉ạ-ỹ]+"} +}}? code-point-properties &= attribute
kXHC1983 { list { xsd:string
{pattern="[0-9,.*]+:[a-zǜ́̄̈̌]+"} +}} ? |
^[A-Za-z\x{110}\x{111}\x{300}-\x{303}\x{306}\x{309}\x{31B}\x{323}]+$ ^[0-9]{4}\.[0-9]{3}\*?(,[0-9]{4}\.[0-9]{3}\*?)*:[a-z\x{300}\x{301}\x{304}\x{308}\x{30C}]+$ |
The
regular expression is more restrictive in UAX 38 |
Change UAX
42 to use UAX 38 regular expression. |