L2/02-079

February 12, 2002

Eric Muller

PDUTR: UCD in XML

Table of Content

1.  Introduction
2.  DTD
 
2.1.  General principles
2.2.  Collections
2.3.  Aliases, cross-references and comments
2.4.  Code points
2.5.  Ranges of code points and characters
2.6.  Properties
2.7.  Name Properties
2.8.  General Category
2.9.  Combining Properties
2.10.  Bidirectionality Properties
2.11.  Decomposition Properties
2.12.  Numeric Properties
2.13.  Joining Properties
2.14.  Linebreak Properties
2.15.  East Asian Width Properties
2.16.  Case Properties
2.17.  Script Properties
2.18.  ISO Comment Properties
2.19.  Unihan properties
2.20.  Complete DTD
3.  Examples
4.  UCD to XML
 
4.1.  Internal data structures
4.2.  Utilities
4.3.  UnicodeData.txt
4.4.  NamesList.txt
4.5.  Unihan.txt
4.6.  BidiMirroring.txt
4.7.  ArabicShaping.txt
4.8.  Linebreak.txt
4.9.  EastAsianWidth.txt
4.10.  SpecialCasing.txt properties
4.11.  Scripts.txt
4.12.  DerivedAge.txt
4.13.  Generating XML
4.14.  Complete program
5.  Using the XML version

1. Introduction

In working on Unicode implementations, it is often useful to access the full content of the Unicode character database (UCD). For example, in establishing mappings from characters to glyphs in fonts, it is convenient to see the character scalar value, the character name, the character cross-references, the character east asian width, along with the shape and metrics of the proposed glyph to map to; looking at all this data simultaneously helps in evaluating the mapping.

Accessing directly the data files that constitute the UCD is sometime a daunting proposition. The data is dispersed in a number of files of various formats, and there are just enough peculiarities (all justified by the available processing power available at the time the UCD was designed) to require a fairly intimate knowledge of the data format itself, in addition to the meaning of the data.

Many programming environments (e.g. Java or ICU) do give access to the UCD. However, those environments tend to lag behind releases of the standard, or support only some of the UCD content.

Unibook is a wonderful tool to explore the UCD and in many cases is just the ticket; however, it is difficult to use when the task at hand has not been built-in, or when non-UCD data is to be displayed along.

This paper presents an alternative representation of the UCD, which is meant to overcome these difficulties. We have chosen an XML representation, because parsing becomes a non-issue: there are a number of XML parsers freely available, and using them is often fairly easy. In addition, there are freely available tools that can perform powerful operations on XML data; for example, XPATH and XQUERY engines can be thought of a "grep" for XML data and XSLT engines can be thought of as "awk" for XML data.

It is important to note that we are interested in exploring the content of the UCD, rather than appling UCD-based processing to character streams. Thus, we are not concerned so much by the speed of processing or the size of our representation.

Our representation supports the creation of documents that represent only parts of the UCD, either by not representing all the characters, or by not representing all the properties. This can be useful when only some of the data is needed.

2. DTD

2.1. General principles

Of course, characters are pervasive in the UCD, and will need to be represented. Representing characters directly by themselves would seem the most obvious choice; for example, we could express that the decomposition of U+00E8 is “è”, i.e. have exactly two characters in (the infoset of) the XML document. However, the current XML specification limits the set of characters that can be part of a document. Another problem is that the various tools (XML parser, XPATH engine, etc.) may equate U+00E8 with U+0065 U+0300, thus making it difficult to figure out which of the two sequences is contained in the database (which is somewhat important for our purposes). Therefore, we chose instead to represent characters by their code points; we follow the usual convention of four to six hexadecimal digits (uppercase) and code points in a sequence separated by space; e.g., the decomposition of U+00E8 will be represented by the nine characters “0065 0300” in the infoset.

2.2. Collections

A collection is a set of code point descriptions. As we will see shortly, each code point will be represented by one of four elements:

<!ELEMENT collection (char | notachar | reserved | unassigned)*>
<!ATTLIST collection
      
[collection.attributes: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]>

A collection can be partial in two ways.

First, it can describe only some of the Unicode code points. The cps attribute lists those code points. It is a sequence of space separated ranges, where each range is either a single code point or a pair of code points separated by “-”. If a code point which is part of cps is not explicitly represented by a children of the collection, it is implictly unassigned.

[collection attribute for the code points] ==
      
   cps   CDATA #REQUIRED

Second, it can describe only some of the Unicode properties. For each property, there is one attribute on the collection element, which can takes one of the values “Y” and “N”. If an attribute has the value “Y”, then all the char elements in the collection do specify the value of the property (either explicitly or implicitly via a default value in the DTD). Conversely, if an attribute has the value “N”, then none of the char elements specify the value of the property, even if the DTD gives a default value; in this case, consumers should be careful to discard the value reported by the XML parser. Those collection attributes will be introduced along with the elements and attributes used to represent the values of the properties.

2.3. Aliases, cross-references and comments

The aliases, cross-references and comments out of NamesList.txt are represented as elements, since there can be more than one of each in any context where they occur. We take advantage of the fact that they have no internal structure to record their values as attributes rather than text child nodes.

The value of a cross reference contains only the code point of the target character.

[alias, comment and crossRef] ==
      

<!ELEMENT alias EMPTY>
<!ATTLIST alias v CDATA #REQUIRED>

<!ELEMENT comment EMPTY>
<!ATTLIST comment v CDATA #REQUIRED>

<!ELEMENT crossRef EMPTY>
<!ATTLIST crossRef v CDATA #REQUIRED>

Here is a fairly extensive example, for U+0027:

      <alias v="APOSTROPHE-QUOTE"/>
      <alias v="APL quote"/>
      <comment v="neutral (vertical) glyph having mixed usage"/>
      <comment v="preferred character for apostrophe is 2019"/>
      <comment v="preferred characters in English for paired quotation marks are 2018 &amp; 2019"/>
      <crossRef v="02B9"/>
      <crossRef v="02BC"/>
      <crossRef v="02C8"/>
      <crossRef v="0301"/>
      <crossRef v="2032"/>

2.4. Code points

If a code point has been designated as noncharacter, we represent it by a notachar element; we record the code point itself, potentially the version of the standard in which it was designated so, as well as comments and cross references it may have.

<!ELEMENT notachar (comment, crossRef)*>
<!ATTLIST notachar
  cp      CDATA   #REQUIRED
  age     CDATA   #IMPLIED>

Here is an example:

   <notachar cp="FFFE" age="1.1">
      <comment v="the value FFFE is guaranteed not to be a Unicode character at all"/>
      <comment v="may be used to detect byte order by contrast with FEFF which is a character"/>
      <crossRef v="FEFF"/>
   </notachar>

If a code point has been reserved, we represent it by a reserved element. We record the code point itself, potentially the version of the standard in which it was designated so. as well as cross-references it may have:

<!ELEMENT reserved (crossRef*)>
<!ATTLIST reserved
  cp      CDATA   #REQUIRED
  age     CDATA   #IMPLIED>

Here is an example:

   <reserved cp="0B35">
      <crossRef v="0B2C"/>
   </reserved>

If a code point has been assigned a character, we represent it by a char element. As before we record the code point itself, and potentially the version of the standard in which it was assigned. In addition, we record the cross references, aliases, comments and Unicode properties for that character. In general, we use attributes when the property has a single occurrence, and elements when the property can have multiple occurrences (currently, kPhonetic from Unihan.txt):

<!ELEMENT char (crossRef | alias | comment 
                   
[Unihan elements for char])*>
<!ATTLIST char
  cp      CDATA   #REQUIRED
  age     CDATA   #IMPLIED
  
[character.attributes: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]>

Here is an example (the attributes will be described later):

   <char cp="0041" age="1.1" na="LATIN CAPITAL LETTER A" gc="Lu"
         lc="0061" jt="T" ea="Na" sc="Latn"/>

The final case to handle is that of unassigned code points; the only thing we have is the code point itself:

<!ELEMENT unassigned EMPTY>
<!ATTLIST unassigned
  cp      CDATA   #REQUIRED>

Here is an example:

   <unassigned cp="1060"/>

Two things to note here: our DTD does not capture the allocation of code points (that is 0000 to FFFF in Unicode 1.0, and 10000 to 10FFFF in Unicode 2.0); since this is very stable and simple, it does not seem worth recording. The other thing is the status of surrogate code points, which is a little bit akward, owing to the history of those things. The current standard treats those are ordinary code points with somewhat ordinary characters; another point of view is that surrogates are designations of code points, and that there no characters assigned to them. This distinction is not essential to our purposes, so we will follow the current wording of the standard.

2.5. Ranges of code points and characters

It is often the case that many successive code points have the same description. At the stage, our DTD does not support that; however, should it be extended to do so, the preferred approach would be to have elements for notacharacterRange, reservedRange, characterRange and unassignedRange.

2.6. Properties

As we described earlier, each property will contribute two pieces to the DTD:

We will describe both pieces for each property in turn.

2.7. Name Properties

There are two name properties: the name given by the current version of the standard (na), and possibly the name this character had in version 1.0 of the standard (na1). na is required if expressed; na1 has a default value of “”.

[Names attributes for collection] ==
      
  na              (Y | N)      "N"
  na1             (Y | N)      "N"

[Names attributes for char] ==
      
  na               CDATA #IMPLIED
  na1              CDATA ""

2.8. General Category

The general category is represented by the gc attribute. There is no default for this property. The possible values are those listed in table 4.5 of the standard.

[General Category attributes for collection] ==
      
  gc              (Y | N)      "N"

[General Category attributes for char] ==
      
  gc      ( Lu | Ll | Lt | Lm | Lo
          | Mn | Mc | Me 
          | Nd | Nl | No
          | Pc | Pd | Ps | Pe | Pi | Pf | Po
          | Sm | Sc | Sk | So
          | Zs | Zl | Zp
          | Cc | Cf | Cs | Co | Cn)             #IMPLIED

2.9. Combining Properties

The combining class is represented by the ccc attribute, which holds the decimal representation of the combining class. The default value is the very common value 0.

[Combining attributes for collection] ==
      
  ccc             (Y | N)      "N"

[Combining attributes for char] ==
      
  ccc     CDATA                "0"

2.10. Bidirectionality Properties

The bidirectional category is represented by the bc attribute. The possible values are those listed in table 3.8 of the standard. Since L (left to right) is the most common case, we make it the default value.

The mirrored property is represented by the Bidi_M attribute, which can take the values “Y” or “N”, the later being the default.

If the mirrored property is true, then the bmg attribute may be present. Its value is a code point a fo character whose glyph is typically a mirrored image of the glyph for the current character. The bmg may be absent either because this property is not expressed in the document, or there is no appropriate character.

Note that we do not express the “Best Fit” element recorded in BidiMirroring.txt. For one thing, it is not meant to be machine readable. More importantly, the idea underlying the mirrored glyph is delicate to use, since it make assumptions about the design of the fonts, and the best fit goes even farther.

[Bidirectionality attributes for collection] ==
      
  bc              (Y | N)      "N"
  Bidi_M          (Y | N)      "N"
  bmg             (Y | N)      "N"

[Bidirectionality attributes for char] ==
      
  bc          ( AL | AN 
              | B  | BN
              | CS
              | EN | ES  | ET
              | L  | LRE | LRO
              | NSM 
              | ON
              | PDF
              | R  | RLE | RLO
              | S
              | WS)                    "L"

  Bidi_M      ( Y | N )                "N"
  bmg         CDATA                    #IMPLIED

2.11. Decomposition Properties

The decomposition type is represented by the dt attribute. The possible values are can for characters with a canonical decomposition, no for characters without a decomposition (either canonical or compatibility) or the tag of a compatibility decomposition (using the values defined by PropertyAliases). The most common case, no, is the default.

If the decomposition type is not no, then the decomposition mapping, recorded by the dm attribute, is meaningful, and must be present if this property is expressed. The value of this attribute is code point sequence into which this character decomposes.

[Decomposition attributes for collection] ==
      
  dt              (Y | N)      "N"
  dm              (Y | N)      "N"

[Decomposition attributes for char] ==
      
  dt      ( can | com | enc | fin | font | fra
          | init | iso | med | nar | nb | no | sml 
          | sqr | sub | sup | vert | wide)            "no"
  dm      CDATA #IMPLIED

2.12. Numeric Properties

The numeric type is represented by the nt attribute. The possible values are:

The most common case, no, is the default.

If the numeric type is not no, then the numeric value must be present, if represented in the data file. It is represented by the nv attribute, which holds the corresponding sequence of code points from the UnicodeData.txt database file.

[Numeric attributes for collection] ==
      
  nt              (Y | N)      "N"
  nv              (Y | N)      "N"

[Numeric attributes for char] ==
      
  nt      ( de | di | no | nu )             "no"
  nv      CDATA                              #IMPLIED

2.13. Joining Properties

The joining class of a character is represented by the jt attribute. The possible values are those listed in table 8.2 of the standard. The most common value, U, is the default.

If the joining class is neither “U”, “C” nor “T”, then the jg attribute must be present, if represented in the data file, and its value is the joining group of the character. There is no default value for this attribute.

[Joining attributes for collection] ==
      
  jt              (Y | N)      "N"
  jg              (Y | N)      "N"

[Joining attributes for char] ==
      
  jt          ( C | D | L | R | T | U )            "U"
  jg          CDATA                                #IMPLIED

2.14. Linebreak Properties

The linebreak property is represented by the lb attribute. The possible values are those listed in Table 1 of UTR 14. The most common value, AL is the default value.

[Linebreak attributes for collection] ==
      
  lb              (Y | N)      "N"

[Linebreak attributes for char] ==
      
  lb      ( AI | AL | B2 | BA | BB | BK | CB 
          | CL | CM | CR | EX | GL | HY | ID
          | IN | IS | LF | NS | NU | OP | PO
          | PR | QU | SA | SG | SP | SY | XX 
          | ZW)                                     "AL"

2.15. East Asian Width Properties

The east asian width property is represented by the ea attribute. The possible values are the abbreviated names listed in section 4 of UTR 11. The most common value, N, is the default.

[East Asian Width attributes for collection] ==
      
  ea              (Y | N)      "N"

[East Asian Width attributes for char] ==
      
  ea      ( A | F | H | N | Na | W )            "N"

2.16. Case Properties

If a character is cased (that is, its general category is Lu, Ll or Lt), then simple case mappings (if expressed in the data file) must be present and they are recorded using the suc, slc, stc attributes. These values of these attributes are the character sequences, and there are no default values.

[Case mapping attributes for collection] ==
      
  suc              (Y | N)      "N"
  slc              (Y | N)      "N"
  stc              (Y | N)      "N"

[Case mapping attributes for char] ==
      
  suc      CDATA #IMPLIED
  slc      CDATA #IMPLIED
  stc      CDATA #IMPLIED

If the character has non-simple casing, this is captured by the uc, lc and tc attributes:

[Case mapping attributes for collection] ==
      
  uc              (Y | N)      "N"
  lc              (Y | N)      "N"
  tc              (Y | N)      "N"

[Case mapping attributes for char] ==
      
  uc      CDATA #IMPLIED
  lc      CDATA #IMPLIED
  tc      CDATA #IMPLIED

2.17. Script Properties

The script property is represented by the sc attribute, using the values specified by PropetyAliases. There is no default value.

[Script attributes for collection] ==
      
  sc              (Y | N)      "N"

[Script attributes for char] ==
      
  sc      ( Arab | Armn | Beng | Bopo | Cans | Cher | Cyrl | Deva 
          | Dsrt | Ethi | Geor | Goth | Grek | Gujr | Guru | Hang
          | Hani | Hebr | Hira | Ital | Kana | Khmr | Knda | Laoo
          | Latn | Mlym | Mong | Mymr | Ogam | Orya | Qaai |Runr
          | Sinh | Syrc | Taml | Telu | Thaa | Thai | Tibt | Yiii
          | Zyyy)          #IMPLIED

2.18. ISO Comment Properties

The ISO 10646 comment field is represented by the iso attribute. The default value is the empty string.

[ISO comment attributes for collection] ==
      
  iso              (Y | N)      "N"

[ISO comment attributes for char] ==
      
  iso  CDATA #IMPLIED

2.19. Unihan properties

The Unihan properties (from Unihan.txt) are represented as attributes, except the kPhonetic property, which can occur multiple times and is therefore represented by an element. No property as a default value.

[Unihan attributes for collection] ==
      
  kAlternateHanYu             (Y | N)    "N"
  kAlternateKangXi            (Y | N)    "N"
  kAlternateMorohashi         (Y | N)    "N"
  kBigFive                    (Y | N)    "N"
  kCCCII                      (Y | N)    "N"
  kCNS1986                    (Y | N)    "N"
  kCNS1992                    (Y | N)    "N"
  kCangjie                    (Y | N)    "N"
  kCantonese                  (Y | N)    "N"
  kCowles                     (Y | N)    "N"
  kDaeJaweon                  (Y | N)    "N"
  kDefinition                 (Y | N)    "N"
  kEACC                       (Y | N)    "N"
  kFenn                       (Y | N)    "N"
  kGB0                        (Y | N)    "N"
  kGB1                        (Y | N)    "N"
  kGB3                        (Y | N)    "N"
  kGB5                        (Y | N)    "N"
  kGB7                        (Y | N)    "N"
  kGB8                        (Y | N)    "N"
  kHKGlyph                    (Y | N)    "N"
  kHKSCS                      (Y | N)    "N"
  kHanYu                      (Y | N)    "N"
  kIBMJapan                   (Y | N)    "N"
  kIRGDaeJaweon               (Y | N)    "N"
  kIRGDaiKanwaZiten           (Y | N)    "N"
  kIRGHanyuDaZidian           (Y | N)    "N"
  kIRGKangXi                  (Y | N)    "N"
  kIRG_GSource                (Y | N)    "N"
  kIRG_HSource                (Y | N)    "N"
  kIRG_JSource                (Y | N)    "N"
  kIRG_KPSource               (Y | N)    "N"
  kIRG_KSource                (Y | N)    "N"
  kIRG_TSource                (Y | N)    "N"
  kIRG_VSource                (Y | N)    "N"
  kJIS0213                    (Y | N)    "N"
  kJapaneseKun                (Y | N)    "N"
  kJapaneseOn                 (Y | N)    "N"
  kJis0                       (Y | N)    "N"
  kJis1                       (Y | N)    "N"
  kKPS0                       (Y | N)    "N"
  kKPS1                       (Y | N)    "N"
  kKSC0                       (Y | N)    "N"
  kKSC1                       (Y | N)    "N"
  kKangXi                     (Y | N)    "N"
  kKarlgren                   (Y | N)    "N"
  kKorean                     (Y | N)    "N"
  kLau                        (Y | N)    "N"
  kMainlandTelegraph          (Y | N)    "N"
  kMandarin                   (Y | N)    "N"
  kMatthews                   (Y | N)    "N"
  kMeyerWempe                 (Y | N)    "N"
  kMorohashi                  (Y | N)    "N"
  kNelson                     (Y | N)    "N"
  kPhonetic                   (Y | N)    "N"
  kPseudoGB1                  (Y | N)    "N"
  kRSJapanese                 (Y | N)    "N"
  kRSKanWa                    (Y | N)    "N"
  kRSKangXi                   (Y | N)    "N"
  kRSKorean                   (Y | N)    "N"
  kRSUnicode                  (Y | N)    "N"
  kSemanticVariant            (Y | N)    "N"
  kSimplifiedVariant          (Y | N)    "N"
  kSpecializedSemanticVariant (Y | N)    "N"
  kTaiwanTelegraph            (Y | N)    "N"
  kTang                       (Y | N)    "N"
  kTotalStrokes               (Y | N)    "N"
  kTraditionalVariant         (Y | N)    "N"
  kVietnamese                 (Y | N)    "N"
  kXerox                      (Y | N)    "N"
  kZVariant                   (Y | N)    "N"

[Unihan attributes for char] ==
      
  kAlternateHanYu             CDATA  #IMPLIED
  kAlternateKangXi            CDATA  #IMPLIED
  kAlternateMorohashi         CDATA  #IMPLIED
  kBigFive                    CDATA  #IMPLIED
  kCCCII                      CDATA  #IMPLIED
  kCNS1986                    CDATA  #IMPLIED
  kCNS1992                    CDATA  #IMPLIED
  kCangjie                    CDATA  #IMPLIED
  kCantonese                  CDATA  #IMPLIED
  kCowles                     CDATA  #IMPLIED
  kDaeJaweon                  CDATA  #IMPLIED
  kDefinition                 CDATA  #IMPLIED
  kEACC                       CDATA  #IMPLIED
  kFenn                       CDATA  #IMPLIED
  kGB0                        CDATA  #IMPLIED
  kGB1                        CDATA  #IMPLIED
  kGB3                        CDATA  #IMPLIED
  kGB5                        CDATA  #IMPLIED
  kGB7                        CDATA  #IMPLIED
  kGB8                        CDATA  #IMPLIED
  kHKGlyph                    CDATA  #IMPLIED
  kHKSCS                      CDATA  #IMPLIED
  kHanYu                      CDATA  #IMPLIED
  kIBMJapan                   CDATA  #IMPLIED
  kIRGDaeJaweon               CDATA  #IMPLIED
  kIRGDaiKanwaZiten           CDATA  #IMPLIED
  kIRGHanyuDaZidian           CDATA  #IMPLIED
  kIRGKangXi                  CDATA  #IMPLIED
  kIRG_GSource                CDATA  #IMPLIED
  kIRG_HSource                CDATA  #IMPLIED
  kIRG_JSource                CDATA  #IMPLIED
  kIRG_KPSource               CDATA  #IMPLIED
  kIRG_KSource                CDATA  #IMPLIED
  kIRG_TSource                CDATA  #IMPLIED
  kIRG_VSource                CDATA  #IMPLIED
  kJIS0213                    CDATA  #IMPLIED
  kJapaneseKun                CDATA  #IMPLIED
  kJapaneseOn                 CDATA  #IMPLIED
  kJis0                       CDATA  #IMPLIED
  kJis1                       CDATA  #IMPLIED
  kKPS0                       CDATA  #IMPLIED
  kKPS1                       CDATA  #IMPLIED
  kKSC0                       CDATA  #IMPLIED
  kKSC1                       CDATA  #IMPLIED
  kKangXi                     CDATA  #IMPLIED
  kKarlgren                   CDATA  #IMPLIED
  kKorean                     CDATA  #IMPLIED
  kLau                        CDATA  #IMPLIED
  kMainlandTelegraph          CDATA  #IMPLIED
  kMandarin                   CDATA  #IMPLIED
  kMatthews                   CDATA  #IMPLIED
  kMeyerWempe                 CDATA  #IMPLIED
  kMorohashi                  CDATA  #IMPLIED
  kNelson                     CDATA  #IMPLIED
  kPseudoGB1                  CDATA  #IMPLIED
  kRSJapanese                 CDATA  #IMPLIED
  kRSKanWa                    CDATA  #IMPLIED
  kRSKangXi                   CDATA  #IMPLIED
  kRSKorean                   CDATA  #IMPLIED
  kRSUnicode                  CDATA  #IMPLIED
  kSemanticVariant            CDATA  #IMPLIED
  kSimplifiedVariant          CDATA  #IMPLIED
  kSpecializedSemanticVariant CDATA  #IMPLIED
  kTaiwanTelegraph            CDATA  #IMPLIED
  kTang                       CDATA  #IMPLIED
  kTotalStrokes               CDATA  #IMPLIED
  kTraditionalVariant         CDATA  #IMPLIED
  kVietnamese                 CDATA  #IMPLIED
  kXerox                      CDATA  #IMPLIED
  kZVariant                   CDATA  #IMPLIED

[kPhonetic element] ==
      

<!ELEMENT kPhonetic EMPTY>
<!ATTLIST kPhonetic v CDATA #REQUIRED>

[Unihan elements for char] ==
      
| kPhonetic

2.20. Complete DTD

Finally, we can put our DTD together:

[UCD DTD] ==
      
  
[dtd: 1, 2, 3, 4, 5, 6, 7]

3. Examples

Here is a fragment of the full UCD (i.e. all properties expressed), for a few representative characters:

[Examples] ==
      

   <char cp="001F" age="1.1" na="&lt;control&gt;" na1="UNIT SEPARATOR" 
         gc="Cc" bc="S" lb="CM">
      <alias v="UNIT SEPARATOR"/>
   </char>

   <char cp="0020" age="1.1" na="SPACE" gc="Zs" bc="WS" ea="Na" lb="SP">
      <comment v="sometimes considered a control code"/>
      <comment v="other space characters: 2000-200A"/>
      <crossRef v="00A0"/>
      <crossRef v="200B"/>
      <crossRef v="3000"/>
      <crossRef v="FEFF"/>
   </char>

   <char cp="0026" age="1.1" na="AMPERSAND" gc="Po" bc="ON" ea="Na"/>

   <char cp="0028" age="1.1" na="LEFT PARENTHESIS" na1="OPENING PARENTHESIS"
         gc="Ps" bc="ON" Bidi_M="Y" bmg="0029" ea="Na" lb="OP">
      <alias v="OPENING PARENTHESIS"/>
   </char>

   <char cp="0041" age="1.1" na="LATIN CAPITAL LETTER A"
         gc="Lu" slc="0061" ea="Na" sc="Latn"/>

   <char cp="AC00" age="2.0" na="HANGUL SYLLABLE GA" gc="Lo"
         dt="can" dm="1100 1161" ea="W" lb="ID" sc="Hang"/>

   <char cp="20094" age="3.1" na="CJK UNIFIED IDEOGRAPH-20094"
         gc="Lo" ea="W" lb="ID" sc="Hani" kIRG_GSource="KX"
         kIRGHanyuDaZidian="10036.060" kIRG_TSource="5-214E"
         kRSUnicode="4.3" kIRGKangXi="0082.090">
      <kPhonetic v="0139"/>
   </char>

4. UCD to XML

This section contains a program that creates the XML representation of the UCD from the current set of files.

Our general strategy is to first parse UnicodeData.txt. For each code point listed in it, we create a CodePoint object to record its properties, and start to fill them. Next, we parse NamesList.txt; this provides us with the assigned code points of non-characters, as well as all the comments, aliases and cross references. We then parse the other data files to record more properties or overrides the properties from UnicodeData.txt.

4.1. Internal data structures

Here is the class that captures all the properties of a code point.

[CodePoint class] ==
      
  public static final int CHAR = 1;
  public static final int NOTACHAR = 2;
  public static final int RESERVED = 3;

  public class CodePoint {
    int type;  // CHAR, NOTACHAR or RESERVED
    int cp;
    String age;

    String[] aliases = null;
    String[] comments = null;
    String[] crossRefs = null;

    String na;
    String na1;
    String gc;
    String ccc;
    String bc;
    String Bidi_M;
    String bmg;
    String dt;
    String dm;
    String nt;
    String nv;
    String jt;
    String jg;
    String lb;
    String ea;

    String suc;
    String slc;
    String stc;

    String uc;
    String lc;
    String tc;

    String sc;
    String iso;  

    String[] kPhonetic;
    java.util.Map unihan = null;

    
[codepoint.methods: 1, 2]
  }

We collect all the code points in a map, indexed by the code point (represented as an Integer):

[All the code points] ==
      
  public java.util.Map codePoints = new java.util.TreeMap ();

The constructor is given the code point value and type of the code point, and inserts itself in the collection of all code points. Some of the properties, such as the east asian width, are not expressed explicitly for all code points. In the constructor, we assign default values for those properties as appropriate; these default values are described as we explore the data files.

[CodePoint constructor] ==
      
  public CodePoint (int type, int cp) {
    this.type = type;
    this.cp = cp;
    codePoints.put (new Integer (cp), this);

    
[properties.initialize: 1, 2, 3, 4, 5]
  }

Here is a method to emit the XML element corresponding to a code point. Although it is a bit long, it is rather straightforward:

[Emitting the XML element of a code point] ==
      
  public void toXML (ContentHandlerPlus ch) 
      throws SAXException {

    AttributesImpl at = new AttributesImpl ();
    at.addAttribute ("", "cp", "cp", "CDATA", toU (cp));

    String elt = null;

    if (type == RESERVED) {
      elt = "reserved"; }

    else if (type == NOTACHAR) {
      elt = "notachar"; }

    else {
      if (null != age) {
        at.addAttribute ("", "age", "age", "CDATA", age); }

      if (! "".equals (na)) {
        at.addAttribute ("", "na", "na", "CDATA", na); }

      if (! "".equals (na1)) {
        at.addAttribute ("", "na1", "na1", "CDATA", na1); }

      if (! "".equals (iso)) {
        at.addAttribute ("", "iso", "iso", "CDATA", iso); }

      at.addAttribute ("", "gc", "gc", "CDATA", gc);

      if (! "0".equals (ccc)) {
        at.addAttribute ("", "ccc", "ccc", "CDATA", ccc); }

      if (! "no".equals (nt)) {
        at.addAttribute ("", "nt", "nt", "CDATA", nt);
        at.addAttribute ("", "nv", "nv", "CDATA", nv); }

      if (! "L".equals (bc)) {
        at.addAttribute ("", "bc", "bc", "CDATA", bc); }

      if (! "N".equals (Bidi_M)) {
        at.addAttribute ("", "Bidi_M", "Bidi_M", "CDATA", "Y"); }

      if (null != bmg) {
        at.addAttribute ("", "bmg", "bmg", "CDATA", bmg); }

      if (! "".equals (suc)) {
        at.addAttribute ("", "suc", "suc", "CDATA", toU (suc)); }
      if (! "".equals (slc)) {
        at.addAttribute ("", "slc", "slc", "CDATA", toU (slc)); }
      if (! "".equals (stc)) {
        at.addAttribute ("", "stc", "stc", "CDATA", toU (stc)); }

      if (uc != null) {
        at.addAttribute ("", "uc", "uc", "CDATA", toU (uc)); }
      if (lc != null) {
        at.addAttribute ("", "lc", "lc", "CDATA", toU (lc)); }
      if (tc != null) {
        at.addAttribute ("", "tc", "tc", "CDATA", toU (tc)); }

      if (! "".equals (dt)) {
        at.addAttribute ("", "dt", "dt", "CDATA", dt); }

      if (! "".equals (dm)) {
        at.addAttribute ("", "dm", "dm", "CDATA", dm); }

      if (! "U".equals (jt)) {
        at.addAttribute ("", "jt", "jt", "CDATA", jt); }

      if (! "U".equals (jt) && ! "C".equals (jt) && ! "T".equals (jt)) {
        at.addAttribute ("", "jg", "jg", "CDATA", jg); }

      if (! "N".equals (ea)) {
        at.addAttribute ("", "ea", "ea", "CDATA", ea); }

      if (! "AL".equals (lb)) {
        at.addAttribute ("", "lb", "lb", "CDATA", lb); }

      if (! "Zyyy".equals (sc)) {
        at.addAttribute ("", "sc", "sc", "CDATA", sc); }

      if (unihan != null) {
        for (Iterator it = unihan.keySet ().iterator (); it.hasNext (); ) {
          String key = (String) it.next ();
          String val = (String) unihan.get (key);
          at.addAttribute ("", key, key, "CDATA", val); }}

      elt = "char"; }

    ch.startElement (elt, at); 

    if (aliases != null) {
      for (int i = 0; i < aliases.length; i++) {
        at = new AttributesImpl ();
        at.addAttribute ("", "v", "v", "CDATA", aliases[i]);
        ch.element ("alias", at); }}

    if (comments != null) {
      for (int i = 0; i < comments.length; i++) {
        at = new AttributesImpl ();
        at.addAttribute ("", "v", "v", "CDATA", comments[i]);
        ch.element ("comment", at); }}

    if (crossRefs != null) {
      for (int i = 0; i < crossRefs.length; i++) {
        at = new AttributesImpl ();
        at.addAttribute ("", "v", "v", "CDATA", toU (crossRefs[i]));
        ch.element ("crossRef", at); }}

    if (kPhonetic != null) {
      for (int i = 0; i < kPhonetic.length; i++) {
        at = new AttributesImpl ();
        at.addAttribute ("", "v", "v", "CDATA", toU (kPhonetic[i]));
        ch.element ("kPhonetic", at); }}
     
    ch.endElement (elt);
  }

4.2. Utilities

Here are two methods to normalize the representation of a code point, starting either from an integer or a string:

[Formatting of code points] ==
      
  public String toU (int n) {
    return toU (Integer.toHexString (n));
  }

  public String toU (String s) {
    while (s.length () < 4) {
      s = "0" + s; }
    return s.toUpperCase ();
  }

Adding a string to an array of strings:

[Adding a String to a String[]] ==
      
  public String[] addString (String[] a, String s) {
    if (a == null) {
      a = new String [1];
      a [0] = s;
      return a; }
    else {
      String[] b = new String [a.length + 1];
      System.arraycopy (a, 0, b, 0, a.length);
      b [a.length] = s;
      return b; }
  }

The data files are all record oriented. After removing comments, we are fundamentally left with a collection of (code point, fields), where the meaning of the fields depends on the specific data file. Here is an abstract class to receive this data, one code point at a time:

[Loader class] ==
      
  public abstract class  Loader {
    public abstract void process (CodePoint p, String[] fields);
  }

Many of the data files share the same format: Each line represents one code point or a group of consecutive code point; a line is organized into fields, separated by semicolons. # starts a comment. This method unwinds such a file, calling the process method on the Loader for each code point. Note that we filter out private use area and surrogates:

[Parsing method for most data files] ==
      
  public void parseSemiFile (String filename, Loader l)
        throws java.io.IOException {

    java.io.LineNumberReader rd
        = new java.io.LineNumberReader 
           (new java.io.FileReader (filename));

    do {
      String s = rd.readLine ();
      if (s == null) {
        break; }

      int comment = s.indexOf ('#');
      if (comment != -1) {
        s = s.substring (0, comment); }

      if (s.length () < 2) {
        continue; }

      java.util.StringTokenizer st = new java.util.StringTokenizer (s, ";");
      int nFields = st.countTokens ();
      String[] fields = new String [nFields];
      for (int i = 0; i < nFields; i++) {
        fields [i] = st.nextToken ().trim (); }

      int first, last;
      int dotdot = fields[0].indexOf ("..");
      if (dotdot != -1) {
        first = Integer.parseInt (fields [0].substring (0, dotdot), 16);
        last = Integer.parseInt (fields [0].substring (dotdot+2), 16); }
      else {
        first = Integer.parseInt (fields[0], 16);
        last = first; }

      for (int cp = first; cp <= last; cp++) {
        if (   0xd800 <= cp && cp <=0xf8ff
            || 0xf0000 <= cp) {
          continue; }
        CodePoint p = (CodePoint) codePoints.get (new Integer (cp));
        l.process (p, fields); }}
        
    while (true);
  }

4.3. UnicodeData.txt

UnicodeData.txt is mostly one line per character, but there are also a few ranges, with one line for the first character in the range, and one line for the last character. We unwind the Hangul syllables and CJK ideographs ranges, gathering most of the data from the file, but computing a couple of fields. We simply ignore the Private Use and Surrogate ranges.

[Parser for UnicodeData.txt] ==
      
  public void parseUnicodeData (String filename, Loader loader)
      throws java.io.IOException {

    java.io.LineNumberReader rd  
       = new java.io.LineNumberReader 
          (new java.io.FileReader ("UnicodeData.txt"));


    String[] choseong = {"G", "GG", "N", "D", "DD", "R", "M", "B",
                         "BB", "S", "SS", "", "J", "JJ", "C", "K",
                         "T", "P", "H"};
    String[] jungseong = {"A", "AE", "YA", "YAE", "EO", "E", "YEO", 
                          "YE", "O", "WA", "WAE", "OE", "YO", 
                          "U", "WEO", "WE", "WI", "YU", "EU", "YI", "I"};
    String[] jongseong = {"", "G", "GG", "GS", "N", "NJ", "NH", "D",
                          "L", "LG", "LM", "LB", "LS", "LT", "LP",
                          "LH", "M", "B", "BS", "S", "SS", "NG",
                          "J", "C", "K", "T", "P", "H"}; 
      
    String [] fields = new String [15];
    

    do {
      String s = rd.readLine ();
      if (s == null) {
        break; }

      int start = 0;
      for (int f = 0; f < 14; f++) {
        int semi = s.indexOf (';', start); 
        fields [f] = s.substring (start, semi);
        start = semi + 1; }
      fields [14] = s.substring (start);

      int cp = Integer.parseInt (fields [0], 16);

      if ("<Hangul Syllable, First>".equals (fields [1])) {
        s = rd.readLine ();  // skip "..., Last"
        for (int l = 0; l < 19; l++) {
          for (int v = 0; v < 21; v++) {
            for (int t = 0; t < 28; t++) {
              CodePoint p = new CodePoint (CHAR, 0xac00 + (l * 21 + v) * 28 + t);

              loader.process (p, fields);

              p.na = "HANGUL SYLLABLE " 
                        + choseong [l] + jungseong [v] + jongseong [t];
              p.dt = "can";
              p.dm = toU (0x1100 + l) + " " + toU (0x1161 + v);
              if (t != 0) {
                p.dm += " " + toU (0x11a7 + t); }}}}}

      else if ("<CJK Ideograph Extension A, First>".equals (fields [1])
            || "<CJK Ideograph, First>".equals (fields [1])
            || "<CJK Ideograph Extension B, First>".equals (fields [1])) {
        s = rd.readLine ();
        int lastCp = Integer.parseInt (s.substring (0, s.indexOf (';', 0)), 16);
        for (int i = cp; i <= lastCp; i++) {
          CodePoint p = new CodePoint (CHAR, i);
          loader.process (p, fields);
          p.na = "CJK UNIFIED IDEOGRAPH-" + toU (i); }}

      else if ("<Non Private Use High Surrogate, First>".equals (fields [1])
            || "<Private Use High Surrogate, First>".equals (fields [1])
            || "<Low Surrogate, First>".equals (fields [1])
            || "<Private Use, First>".equals (fields [1])
            || "<Plane 15 Private Use, First>".equals (fields [1])
            || "<Plane 16 Private Use, First>".equals (fields [1])) {
        // ignore those
        s = rd.readLine (); }

      else {
        CodePoint p = new CodePoint (CHAR, cp);
        loader.process (p, fields); }}
    while (true);
  }  

Given the data for one code point, storing the properties is mostly straightforward; only the decomposition and numeric properties require a bit of processing.

[Process UnicodeData.txt] ==
      
  parseUnicodeData ("UnicodeData.txt",
                    new Loader () {
    public void process (CodePoint p, String[] fields) {

      p.na = fields [1];
      p.gc = fields [2].intern ();
      p.ccc = fields [3];
      p.bc = fields [4];

      if ("".equals (fields [5])) {
        p.dt = "";
        p.dm = ""; }
      else if (fields [5].indexOf (">") >= 0) {
        String tag = fields [5].substring (1, fields [5].indexOf ('>'));
        p.dm = fields [5].substring (fields [5].indexOf ('>') + 2);
    
             if ("compat".equals (tag)) {     p.dt = "com"; }
        else if ("circle".equals (tag)) {     p.dt = "enc"; }
        else if ("final".equals (tag)) {      p.dt = "fin"; }
        else if ("font".equals (tag)) {       p.dt = "font"; }
        else if ("fraction".equals (tag)) {   p.dt = "fra"; }
        else if ("initial".equals (tag)) {    p.dt = "init"; }
        else if ("isolated".equals (tag)) {   p.dt = "iso"; }
        else if ("medial".equals (tag)) {     p.dt = "med"; }
        else if ("narrow".equals (tag)) {     p.dt = "nar"; }
        else if ("noBreak".equals (tag)) {    p.dt = "nb"; }
        else if ("small".equals (tag)) {      p.dt = "sml"; }
        else if ("square".equals (tag)) {     p.dt = "sqr"; }
        else if ("super".equals (tag)) {      p.dt = "sup"; }
        else if ("sub".equals (tag)) {        p.dt = "sub"; }
        else if ("vertical".equals (tag)) {   p.dt = "vert"; }
        else if ("wide".equals (tag))     {   p.dt = "wide"; }
        else {
          System.err.println ("Unknown compatibility tag: '" + tag + "'"); }}
      else {
        p.dt  = "can";
        p.dm = fields [5]; }

      if (! "".equals (fields [6])) {
        p.nt = "de";
        p.nv = fields [6]; }
      else if (! "".equals (fields [7])) {
        p.nt = "di";
        p.nv = fields [7]; }
      else if (! "".equals (fields [8])) {
        p.nt = "nu";
        p.nv = fields [8]; }
      else {
        p.nt = "no";
        p.nv = ""; }

      p.Bidi_M = fields [9];
      p.na1 = fields [10];
      p.iso = fields [11];
      p.suc = fields [12];
      p.slc = fields [13];
      p.stc = fields [14]; }});

4.4. NamesList.txt

While we are discouraged of using NamesList.txt, it is the only published source for the comments, aliases and cross references. Also, it is the only published source for reserved and notacharacter code points.

Cross references include both the name and code point of the target character. We care only about the code point:

[Extracting the code point from a cross reference in NamesList.txt] ==
      
  public String parseCrossRef (String s) {
    String result = "";
    int c = 0;
    while (c < s.length ()) {
      int charStart = c;
      int count = 0;
      while (c < s.length ()
             && (('0' <= s.charAt (c) && s.charAt (c) <= '9')
                 || ('A' <= s.charAt (c) && s.charAt (c) <= 'F'))) {
        c++;
        count++; }
      if (4 <= count && count <= 6) {
        result = s.substring (charStart, charStart+count); }
      c++; }
    return result; 
  }

Parsing NamesList.txt is mostly straighforward, but there are a couple of details:

Since we are interested only by code points, we need to maintain some state from line to line to detect the appropriate case; specifically, we keep p pointing to the CodePoint for the current code point, and are careful to set it to null when encountering the first line for a block.

[Process NamesList.txt] ==
      
  { java.io.LineNumberReader rd 
       = new java.io.LineNumberReader 
          (new java.io.FileReader ("NamesList.txt"));

    CodePoint p = null;

    do {
      String s = rd.readLine ();
      if (s == null) {
        break; }

      if (s.startsWith ("@@@\t")) { // ignore
        p = null; }

      else if (s.startsWith ("@@@+\t")) { // ignore
        p = null; }

      else if (s.startsWith ("@@\t")) {
        p = null; }

      else if (s.startsWith ("@\t")) {
        p = null; }

      else if (s.startsWith ("@+\t")) { // comment
        if (p != null) {
          p.comments = addString (p.comments, s.substring (3)); }}

      else if (! s.startsWith ("\t")) {
        int tab = s.indexOf ('\t');
        int cp = Integer.parseInt (s.substring (0, tab), 16);
        String name = s.substring (tab + 1);

        if ("<reserved>".equals (name)) {
          p = new CodePoint (RESERVED, cp); }

        else if ("<not a character>".equals (name)) {
          p = new CodePoint (NOTACHAR, cp); }

        else {
          p = (CodePoint) codePoints.get (new Integer (cp)); }}

      else if (s.startsWith ("\t=")) {
        if (p != null) {
          p.aliases = addString (p.aliases, s.substring (3)); }}

      else if (s.startsWith ("\t*")) { // comment
        if (p != null) {
          p.comments = addString (p.comments, s.substring (3)); }}

      else if (s.startsWith ("\tx")) {  // crossref
        if (p != null) {
          p.crossRefs = addString (p.crossRefs, parseCrossRef (s.substring (2))); }}}

     while (true); }

4.5. Unihan.txt

[Process Unihan.txt] ==
      
  { java.io.LineNumberReader rd
       = new java.io.LineNumberReader 
          (new java.io.FileReader ("Unihan.txt"));

    CodePoint p = null;

    do {
      String s = rd.readLine ();
      if (s == null) {
        break; }

      if (s.charAt (0) == '#') {
        continue; }

      int t1 = s.indexOf ('\t');
      int t2 = s.indexOf ('\t', t1 + 1);

      int codePoint = Integer.parseInt (s.substring (2, t1), 16);
      String prop = s.substring (t1 + 1, t2);
      String value = s.substring (t2 + 1);

      if (p == null || p.cp != codePoint) {
        p = (CodePoint) codePoints.get (new Integer (codePoint)); }

      if ("kPhonetic".equals (prop)) {
        p.kPhonetic = addString (p.kPhonetic, value); }

      else {
        if (p.unihan == null) {
          p.unihan = new java.util.HashMap (); }

        p.unihan.put (prop, value); }}
    while (true); }

4.6. BidiMirroring.txt

BidiMirroring.txt provides the value for bmg, for some of the characters.

[BidiMirroring.txt] ==
      
  parseSemiFile ("BidiMirroring.txt",
                 new Loader () {
    public void process (CodePoint p, String[] fields) {

      p.bmg = toU (fields [1]); }});

Other characters do not have a mirrored glyph.

[Initial value for bmg] ==
      
  bmg = null;

4.7. ArabicShaping.txt

ArabicShaping.txt lists all the characters in classes R, L, D, and C, as well as some of the characters in class U.

[ArabicShaping.txt] ==
      
  parseSemiFile ("ArabicShaping.txt",
                 new Loader () {
    public void process (CodePoint p, String[] fields) {

      p.jt = fields [2];
      if (! "U".equals (p.jt) && ! "C".equals (p.jt)) {
        p.jg = fields [3]; }}});

The characters in class T are defined to be those with general category Mn or Cf, excluding U+200C and U+200D:

[Rule-based] ==
      
  for (Iterator it = codePoints.values ().iterator (); it.hasNext (); ) {
    CodePoint p = (CodePoint) it.next ();
    if (   (   "Mn" == p.gc || "Cf" == p.gc)
        && p.cp != 0x200c // ZWNJ
        && p.cp != 0x200d) { // ZWJ
      p.jt = "T"; }}

The remaining characters are in class U.

[Default value] ==
      
  jt = "U";

4.8. Linebreak.txt

LineBreak.txt provides the line break property for some of the characters:

[LineBreak.txt] ==
      
  parseSemiFile ("LineBreak.txt",
                 new Loader () {
    public void process (CodePoint p, String[] fields) {

      p.lb = fields [1]; }});

Remaining characters have the value AL.

[Representation of the linebreak property] ==
      
  lb = "AL";

4.9. EastAsianWidth.txt

EastAsianWidth.txt gives the value of the east asian width property for some characters.

[EastAsianWidth.txt] ==
      
  parseSemiFile ("EastAsianWidth.txt",
                 new Loader () {
    public void process (CodePoint p, String[] fields) {

      p.ea = fields [1]; }});

Remaining characters have the value N.

[Representation] ==
      
  ea = "N";

4.10. SpecialCasing.txt properties

The bulk of the case information is provided by UnicodeData.txt, which covers all the cases where the mappings are unconditional and target a single character.

The exceptions are provided in SpecialCasing.txt. At this point, we capture only the unconditional mappings.

[SpecialCasing.txt] ==
      
  parseSemiFile ("SpecialCasing.txt",
                 new Loader () {
    public void process (CodePoint p, String[] fields) {

      if ("".equals (fields [4])) {
        p.lc = fields [1];
        p.tc = fields [2];
        p.uc = fields [3]; }}});

[] ==
      
  uc = null;
  lc = null;
  tc = null;

4.11. Scripts.txt

Scripts.txt gives the script for some characters.

[Scripts.txt] ==
      
  parseSemiFile ("Scripts.txt",
                 new Loader () {
    public void process (CodePoint p, String[] fields) {

           if ("ARABIC".equals (fields [1])) {              p.sc = "Arab"; }
      else if ("ARMENIAN".equals (fields [1])) {            p.sc = "Armn"; }
      else if ("BENGALI".equals (fields [1])) {             p.sc = "Beng"; }
      else if ("BOPOMOFO".equals (fields [1])) {            p.sc = "Bopo"; }
      else if ("CANADIAN-ABORIGINAL".equals (fields [1])) { p.sc = "Cans"; }
      else if ("CHEEROKEE".equals (fields [1])) {           p.sc = "Cher"; }
      else if ("CYRILLIC".equals (fields [1])) {            p.sc = "Cyrl"; }
      else if ("DEVANAGARI".equals (fields [1])) {          p.sc = "Deva"; }
      else if ("DESERET".equals (fields [1])) {             p.sc = "Dsrt"; }
      else if ("ETHIOPIC".equals (fields [1])) {            p.sc = "Ethi"; }
      else if ("GEORGIAN".equals (fields [1])) {            p.sc = "Geor"; }
      else if ("GOTHIC".equals (fields [1])) {              p.sc = "Goth"; }
      else if ("GREEK".equals (fields [1])) {               p.sc = "Grek"; }
      else if ("GUJARATI".equals (fields [1])) {            p.sc = "Gujr"; }
      else if ("GURMUKHI".equals (fields [1])) {            p.sc = "Guru"; }
      else if ("HANGUL".equals (fields [1])) {              p.sc = "Hang"; }
      else if ("HAN".equals (fields [1])) {                 p.sc = "Hani"; }
      else if ("HEBREW".equals (fields [1])) {              p.sc = "Hebr"; }
      else if ("HIRAGANA".equals (fields [1])) {            p.sc = "Hira"; }
      else if ("OLD-ITALIC".equals (fields [1])) {          p.sc = "Ital"; }
      else if ("KATAKANA".equals (fields [1])) {            p.sc = "Kana"; }
      else if ("KHMER".equals (fields [1])) {               p.sc = "Khmr"; }
      else if ("KANNADA".equals (fields [1])) {             p.sc = "Knda"; }
      else if ("LAO".equals (fields [1])) {                 p.sc = "Laoo"; }
      else if ("LATIN".equals (fields [1])) {               p.sc = "Latn"; }
      else if ("MALAYALAM".equals (fields [1])) {           p.sc = "Mlym"; }
      else if ("MONGOLIAN".equals (fields [1])) {           p.sc = "Mong"; }
      else if ("MYANMAR".equals (fields [1])) {             p.sc = "Mymr"; }
      else if ("OGHAM".equals (fields [1])) {               p.sc = "Ogam"; }
      else if ("ORIYA".equals (fields [1])) {               p.sc = "Orya"; }
      else if ("INHERITED".equals (fields [1])) {           p.sc = "Qaai"; }
      else if ("RUNIC".equals (fields [1])) {               p.sc = "Runr"; }
      else if ("SINHALA".equals (fields [1])) {             p.sc = "Sinh"; }
      else if ("SYRIAC".equals (fields [1])) {              p.sc = "Syrc"; }
      else if ("TAMIL".equals (fields [1])) {               p.sc = "Taml"; }
      else if ("TELUGU".equals (fields [1])) {              p.sc = "Telu"; }
      else if ("THAANA".equals (fields [1])) {              p.sc = "Thaa"; }
      else if ("THAI".equals (fields [1])) {                p.sc = "Thai"; }
      else if ("TIBETAN".equals (fields [1])) {             p.sc = "Tibt"; }
      else if ("YI".equals (fields [1])) {                  p.sc = "Yiii"; }
      else if ("COMMON".equals (fields [1])) {              p.sc = "Zyyy"; }}});

Remaining characters are in the COMMON script:

[Representation of the script property] ==
      
  sc = "Zyyy";

4.12. DerivedAge.txt

DerivedAge.txt provides the age of all characters:

[DerivedAge.txt] ==
      
  parseSemiFile ("DerivedAge.txt",
                 new Loader () {
    public void process (CodePoint p, String[] fields) {

      p.age = fields [1]; }});

4.13. Generating XML

[XML Generator] ==
      
  public void toXML () 
      throws SAXException, javax.xml.transform.TransformerConfigurationException {
    TransformerFactory tfactory = TransformerFactory.newInstance ();
    
    if (tfactory.getFeature (SAXSource.FEATURE)) {
      SAXTransformerFactory sfactory = (SAXTransformerFactory) tfactory;

      // no transform; we just want a serializer
      TransformerHandler handler = sfactory.newTransformerHandler ();
     
      handler.setResult (new StreamResult (System.out));

      Transformer transformer = handler.getTransformer ();
      transformer.setOutputProperty (OutputKeys.INDENT, "yes");
      transformer.setOutputProperty (OutputKeys.DOCTYPE_PUBLIC,
                   "-//Unicode Consortium//DTD Unidata V1.0//EN");
      transformer.setOutputProperty (OutputKeys.DOCTYPE_SYSTEM, "unidata.dtd");
      transformer.setOutputProperty (OutputKeys.STANDALONE, "no");

      ContentHandlerPlus ch = new ContentHandlerPlus (handler);

      ch.startDocument (); {
        AttributesImpl at = new AttributesImpl ();
        at.addAttribute ("", "chars", "chars", "CDATA", "0000-10FFFF");
        ch.startElement ("collection", at); {
          for (Iterator it = codePoints.values ().iterator (); it.hasNext (); ) {
            ((CodePoint) it.next ()).toXML (ch); }
          ch.endElement ("collection"); }
        ch.endDocument (); }}

    else {
      System.err.println ("SAXSource.FEATURE not supported"); }
  }

4.14. Complete program

package com.adobe.aots.unicode.ucd;

import java.util.Iterator;

import com.adobe.aots.util.ContentHandlerPlus;

import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;

import org.xml.sax.SAXException;
import org.xml.sax.helpers.AttributesImpl;

public class ToXML {
  
[methods: 1, 2, 3, 4, 5, 6, 7, 8, 9]

  public void doMain (String[] args)  
      throws Exception {

    boolean unihan = false;

    for (int i = 0; i < args.length; i++) {
      if ("-unihan".equals (args [i])) {
        unihan = true; }}

    
[Process UnicodeData.txt]
    
[Process NamesList.txt]
    
[Process other files: 1, 2, 3, 4, 5, 6, 7, 8]

    if (unihan) {
      
[Process Unihan.txt] }
    
    toXML ();
  }

  public static void main (String[] args)
      throws Exception {
    new ToXML().doMain (args);
  }
}

5. Using the XML version

This section shows how some operations on the UCD can be performed using the XML representation.

Our first example selects all the cased letters (char elements with a gc attribute equal to Lu, Ll or Lt), and displays their code point, general category, case mappings and names.

[Display cased letters] ==
      

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

  <xsl:output method="text"/>

  <xsl:template match='collection'>
    <xsl:apply-templates select='char[@gc="Lu"]'/>
    <xsl:apply-templates select='char[@gc="Ll"]'/>
    <xsl:apply-templates select='char[@gc="Lt"]'/>
  </xsl:template>

  <xsl:template match="char">
    <xsl:value-of select="@cp"/>
    <xsl:text> </xsl:text>
    <xsl:value-of select="@gc"/>
    <xsl:text>(lc: </xsl:text>
    <xsl:value-of select='@lc'/>
    <xsl:text>) (uc: </xsl:text>
    <xsl:value-of select='@uc'/>
    <xsl:text>) (tc: </xsl:text>
    <xsl:value-of select='@tc'/>
    <xsl:text>) </xsl:text>
    <xsl:value-of select='@na'/>

    <xsl:text>&#x0a;</xsl:text>
  </xsl:template>

</xsl:stylesheet>

Our second example displays the count of characters for each bidirectional category:

[Count characters in each bidi category] ==
      

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

  <xsl:output method="text"/>

  <xsl:template match='collection'>
any:  <xsl:value-of select='count(char)'/> chars
AL:   <xsl:value-of select='count(char[@bc="AL"])'/> chars
AN:   <xsl:value-of select='count(char[@bc="AN"])'/> chars
B :   <xsl:value-of select='count(char[@bc="B"])'/> chars
BN:   <xsl:value-of select='count(char[@bc="BN"])'/> chars
CS:   <xsl:value-of select='count(char[@bc="CS"])'/> chars
EN:   <xsl:value-of select='count(char[@bc="EN"])'/> chars
ES:   <xsl:value-of select='count(char[@bc="ES"])'/> chars
ET:   <xsl:value-of select='count(char[@bc="ET"])'/> chars
L:    <xsl:value-of select='count(char[@bc="L"])'/> chars
LRE:  <xsl:value-of select='count(char[@bc="LRE"])'/> chars
LRO:  <xsl:value-of select='count(char[@bc="LRO"])'/> chars
NSM:  <xsl:value-of select='count(char[@bc="NSM"])'/> chars
ON:   <xsl:value-of select='count(char[@bc="ON"])'/> chars
PDF:  <xsl:value-of select='count(char[@bc="PDF"])'/> chars
R:    <xsl:value-of select='count(char[@bc="R"])'/> chars
RLE:  <xsl:value-of select='count(char[@bc="RLE"])'/> chars
RLO:  <xsl:value-of select='count(char[@bc="RLO"])'/> chars
S:    <xsl:value-of select='count(char[@bc="S"])'/> chars
WS:   <xsl:value-of select='count(char[@bc="WS"])'/> chars
</xsl:template>

</xsl:stylesheet>

    

Our final example lists the pair of characters such that one decomposes into the other, with a decomposition type “com”, “font” or “nb”. The name is each character is produced. Note the use of an XSLT key to quickly access the target character of a decomposition.

[List some compatibility pairs] ==
      

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

  <xsl:output method="text"/>

  <xsl:key name='char' match='char' use='@cp'/>

  <xsl:template match='collection'>
    <xsl:apply-templates select='char[@dt="com"][string-length(@dm)=4]'/>
    <xsl:apply-templates select='char[@dt="font"][string-length(@dm)=4]'/>
    <xsl:apply-templates select='char[@dt="nb"][string-length(@dm)=4]'/>
  </xsl:template>

  <xsl:template match="char">
    <xsl:value-of select="@cp"/>
    <xsl:text> </xsl:text>
    <xsl:value-of select="@na"/>

    <xsl:text> dt=</xsl:text>
    <xsl:value-of select="@dt"/>
    <xsl:text> dm=</xsl:text>
    <xsl:value-of select='@dm'/>
    <xsl:text> </xsl:text>
    <xsl:value-of select='key("char",@dm)/@na'/>
    <xsl:text>&#x0a;</xsl:text>
  </xsl:template>


</xsl:stylesheet>