L2/11-383

Unicode 6.1 Beta feedback

Emmanuel Vallois

PRI 198         Proposed Update UAX #42: Unicode Character Database in XML

PRI 196         Proposed Update UAX #38: Unicode Han Database (Unihan)

Feed back

UAX 38

First, Some little errors I spotted here and there (mostly editorial):

4.1 Alphabetical Listing: Last sentence before tables explaining fields in detail reads

“The fields covered in the table are: kAccountingNumeric, kAlternateJEF, …”

but the table in 4.2 Listing by Date of Addition to the Unicode Standard says kAlternateJEF has been dropped in Unicode 3.1 (and I haven’t been able to find traces of it in archived versions), and for sure it isn’t covered in the table.

  

UAX #42, 4.4.21 Unihan properties

This has to do with some strange things in patterns, and discrepancies with those of UAX #38. I also verified that one attribute marked as a list {} is indeed indicated as being space-separated in UAX #38.

UAX 42

UAX 38

Comment

Proposed resolution

code-point-properties &= attribute kAlternateHanYu

     { text }?  #old

N/A, kHanYu: ^[1-8][0-9]{4}\.[0-3][0-9][0-3]$

These are no more included in the regular Unihan database, but they should probably be given the same regular expression as their non “Alternate” counterpart (viz. kHanYu, kKangXi, kMorohashi).

Copy Syntax from kHanYu in UAX 38:
{ xsd:string {pattern="[1-8][0-9]{4}\.[0-3][0-9][0-3]"} }?

code-point-properties &= attribute kAlternateJEF

     { text }?  #old

N/A, no idea what this field was for, there is no kJEF field

 

code-point-properties &= attribute kAlternateKangXi

     { text }?

N/A, kKangXi: [0-9]{4}\.[0-9]{2}[01]

Copy Syntax from kHanYu in UAX 38:
{ xsd:string {pattern="[0-9]{4}\.[0-9]{2}[01]"} }?

code-point-properties &= attribute kAlternateMorohashi

     { text }?

N/A, kMorohashi: [0-9]{5}\'?

Copy Syntax from kHanYu in UAX 38:
{ xsd:string {pattern="[0-9]{5}'?"} }?

code-point-properties &= attribute kCNS1992

     { xsd:string {pattern="[123]-[0-9A-F]{4}"} }?

^[1-9]-[0-9A-F]{4}$

The regular expression is more restrictive than in UAX 38

Copy pattern to UAX 38 Syntax:
Syntax:                ^[123]-[0-9A-F]{4}$

code-point-properties &= attribute kCantonese

     { list { xsd:string {pattern="[a-z]+[1-6]"} +}}?

^[a-z]{1,6}[1-6]$

The regular expression is more restrictive in UAX 38

Copy Syntax from kHanYu in UAX 38:
{ list { xsd:string {pattern="[a-z]{1,6}+[1-6]"} +}}?

code-point-properties &= attribute kCCCII

     { xsd:string {pattern="[0-9A-F]{6}"} }?

Delimiter:           space

There is no multiple values for this field

Change UAX 38 to
Delimiter:           N/A

code-point-properties &= attribute kCheungBauer

     { text }?

^[0-9]{3}\/[0-9]{2};[A-Z]*;[a-z1-6\[\]\/,]+$

Delimiter:           space

The regular expression is more restrictive in UAX 38, which is still not very detailed on the last part, therefore I suggest an improvement

To better show the structure of the last part of the field, change both UAX 38 to
Syntax:                ^[0-9]{3}\/[0-9]{2};[A-Z]*;(\[ng\])?[a-z]{1,6}[1-6](\/[1-6])*(,(\[ng\])?[a-z]{1,6}[1-6](\/[1-6])*)*$

And UAX42 to

{ list { xsd:string {pattern="^[0-9]{3}\/[0-9]{2};[A-Z]*;(\[ng\])?[a-z]{1,6}[1-6](\/[1-6])*(,(\[ng\])?[a-z]{1,6}[1-6](\/[1-6])*)*"} +}}?

code-point-properties &= attribute kCheungBauerIndex

     { list { xsd:string {pattern="[0-9]{3}\.[0-9]{2}"} +}}?

^[0-9]{3}\.[01][0-9]$

The regular expression is more restrictive in UAX 38

Copy Syntax from kCheungBauerIndex in UAX 38:
{ list { xsd:string {pattern="[0-9]{3}\.[01][0-9]"} +}}?

code-point-properties &= attribute kCompatibilityVariant

     { "" | xsd:string {pattern="U\+2?[0-9A-F]{4}"} }?

 

The empty string is due to the presence of the attribute on every character, despite only 1,3% of all characters have a value for that attribute.

Don’t output this attribute by default, it contributes to more than 5% of the size of the file ucd.unihan.grouped.xml uselessly.

Change UAX 42 to
{ xsd:string {pattern="U\+2?[0-9A-F]{4}"} }?

code-point-properties &= attribute kDaeJaweon

     { xsd:string {pattern="[0-9]{4}\.[0-9]{2}[0158]"} }?

Syntax  ^[0-9]{4}\.[0-9]{2}[01]$

The regular expression is more restrictive in UAX 38

Copy Syntax from kDaeJaweon in UAX 38:
{ xsd:string {pattern="^[0-9]{4}\.[0-9]{2}[01]"} }?

code-point-properties &= attribute kIRG_JSource

     { "" | xsd:string {pattern="(0|1|3|(3A)|4|A|(ARIB)|K)-[0-9A-F]{4,5}"}

          | xsd:string {pattern="J0-[0-9A-F]{4}"}

          | xsd:string {pattern="J1-[0-9A-F]{4}"}

          | xsd:string {pattern="J3-[0-9A-F]{4}"}

          | xsd:string {pattern="J3A-[0-9A-F]{4}"}

          | xsd:string {pattern="J4-[0-9A-F]{4}"}

          | xsd:string {pattern="JA-[0-9A-F]{4}"}

          | xsd:string {pattern="JH-[0-9A-Z]{6,7}"}

          | xsd:string {pattern="JK-[0-9]{5}"}

          | xsd:string {pattern="JARIB-[0-9A-F]{4}"} }?

^J((([0134AK]|3A|ARIB)-[0-9A-F]{4,5})|(H-(((IB|JT|[0-9]{2})[0-9A-F]{4}S?))))$

The regular expression is more compact and precise in UAX 38. Moreover, the first pattern alternative is incorrect as it lacks the leading J, and all others are contained in UAX 38 syntax. This remark goes for all kIRG_?Source fields.

Change UAX 42 to use UAX 38 regular expression.

code-point-properties &= attribute kIRG_MSource

     { "" | xsd:string {pattern="MAC[0-9]{5}"}

          | xsd:string {pattern="MAC-[0-9]{5}"} }?

^MAC-[0-9]{5}$

Incorrect alternative in UAX 42.

Change UAX 42 to

{ "" | xsd:string {pattern="MAC-[0-9]{5}"} }?

code-point-properties &= attribute kMandarin

     { list {   xsd:string {pattern="[A-ZÜ̈]+[1-5]"} <-- There is Ü + \u308

              |  xsd:string {pattern="[a-z̀́̄̈̌]+"}  }}?

^[a-z\x{300}-\x{302}\x{304}\x{308}\x{30C}]+$

In UAX 42 the old expression seems to survive in the first pattern alternative.

Delete in UAX 42 the first pattern alternative, i.e. conform to UAX 38 syntax :

{ list {   xsd:string {pattern="[a-z̀́̄̈̌]+"}  }}?

code-point-properties &= attribute kVietnamese

     { list { xsd:string {pattern="[A-Za-zŕ-ự̛̀̉-]+"} +}}?

 

  code-point-properties &= attribute kXHC1983

     { list { xsd:string {pattern="[0-9,.*]+:[a-zǜ́̄̈̌]+"} +}} ?

^[A-Za-z\x{110}\x{111}\x{300}-\x{303}\x{306}\x{309}\x{31B}\x{323}]+$

 

^[0-9]{4}\.[0-9]{3}\*?(,[0-9]{4}\.[0-9]{3}\*?)*:[a-z\x{300}\x{301}\x{304}\x{308}\x{30C}]+$

The regular expression is more restrictive in UAX 38

Change UAX 42 to use UAX 38 regular expression.