Re: On the possibility of guidance code points for the Private Use Area

From: Eric Muller (emuller@Adobe.COM)
Date: Mon Apr 23 2001 - 18:10:04 EDT

Next message: Edward Cherlin: "RE: ASCII adequacy (was: RE: benefits of unicode)"
Previous message: Peter_Constable@sil.org: "Re: On the possibility of guidance code points for the PrivateUse Area"
In reply to: Peter_Constable@sil.org: "Re: On the possibility of guidance code points for the Private Use Area"
Next in thread: Timothy Partridge: "Re: On the possibility of guidance code points for the Private Use Area"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I have attached a proposal to describe the meaning of PUA characters in a
document. The idea is that this description would be to the characters as the
DTD is to the XML elements (but it also applies to non-XML documents).

Eric.

Formalizing the Unicode Private Use Area

Table of Content

1.	Motivation
2.	Terminology
3.	Requirements
4.	Overall Structure
5.	Characters
6.	Collections
7.	Related work

1. Motivation

The Unicode standard is a constantly evolving character collection, and there may be times when one needs a character that is not yet part of the standard. Unicode recognizes this situation:

[p23] A contiguous area of codes has been set aside for private use. Characters in this area will never be defined by the Unicode Standard. These codes can be freely used for characters of any purpose, but successful interchange requires an agreement between sender and receiver on their interpretation.

Indeed, a document that uses PUA code points does not have a meaning by itself, just like a document where the encoding is not specified has no meaning by itself.

First and foremost, this note provides a mean to build those agreements. The idea is that a document could specify a semantics for the Private Use Area characters it contains, at the same level as Unicode specifies a semantics for the assigned characters (i.e. those that are part of the Unicode repertoire). Just like Unicode, part of the semantics is formalized and represented in a machine readable form, and part of it is informal.

2. Terminology

A gaiji character is a character that is not part of the Unicode repertoire and is encoded in the PUA. In this document, there is no intention to restrict gaiji characters to ideographs. Of course, this notion is relative to a particular version of the Unicode standard.

3. Requirements

The design goals are:

Define a syntax to describe the formal part of the Unicode semantics of characters. By describing gaiji characters in that way, they can become full participants in Unicode processing. For example, one could indicate that a new character COMBINING REVERSE SOLIDUS OVERLAY is in combinining class 1 and processors that deal with combinations would do the right thing on this character. (By the way this is not an innocent example: this character was accepted for inclusion in June 1999, but will be part of Unicode only after version 3.0; so right now, it's a gaiji.)
Make that syntax extensible, so that additional properties can be attached. For example, there could be indications for Input Method Editors on how they should let the user input those characters.
Define a syntax to organize character descriptions in collections and to combine collections. Consider the case where Alice's document uses one collection of private characters, and Bob's document uses another one, and Charles creates a document that combines Alice's Bob's documents. While this example may seems contrieved, replace persons by machines and it suddenly looks a lot more real.
Make that syntax extensible, so that additional properties can be attached. For example, a collection could indicate where an appropriate font could be found.
Make these descriptions human-legible and easy to process by programs. Practically this means that descriptions can be built using simple tools such as text editors, yet they can be incorporated in sophisticated document processing systems.
Allow two collections to overlap (i.e. to assign the same value to one code point) to avoid central administration, and provide a mechanism to reconcile them. What if Alice and Bob both used U+E732 in their collections?
Support the naming and referencing of character collections, in particular over the Internet. Clearly, there will be collections of gaiji characters that will be used in a number of documents. Repeating all the character descriptions in all the documents would be a logistical nightmare. In addition, it would make it difficult to know if the code value U+E732 represents the same character in two different documents. At least, if both documents reference the same collection (or more precisely, if the code value was assigned by the same subcollection), this guarantee can be given.
Define mechanisms to incorporate or attach references to character collections to documents.

4. Overall Structure

The goals dealing with extensibility, human readability, and machine processing are easily satisfied by using XML.This document describes a DTD.

Open:
Should we go directly to XML Schema instead?

Open:
Usual questions about DTDs: what characters should we use in element names (-, _, camelCase)? elements or attributes?.

Open:
Use namespaces for the extensions?

5. Characters

The unicode-name element encloses the Unicode name of that character. It is not applicable to gaiji characters.

The name element is used for non-Unicode characters.

Exactly one of unicode-name and name must be present.

The unicode-1.0-name element encloses the Unicode 1.0 name of the character, if it exists.

The alternative-names element encloses a set of alternative-name elements, which in turn enclose alternative names for this character.

<!ELEMENT unicode-name (#PCDATA)> <!ELEMENT unicode-1.0-name (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT alternative-names (alternative-name)*> <!ELEMENT alternative-name (#PCDATA)>

The code element encloses the Unicode code value of the character, using the U+xxxx syntax.

The char element contains a single character, which is the character itself.

<!ELEMENT code (#PCDATA)> <!ELEMENT char (#PCDATA)>

The cross-references element encloses a set of cross-ref elements. Each cross-ref element contains a code element and a name element for the character which is referenced. The cross-ref element has a role attribute which can take the values inequal or other. The default value for that attribute is other.

<!ELEMENT cross-references (cross-ref)*> <!ELEMENT cross-ref (code, name)> <!ATTLIST cross-ref role CDATA #IMPLIED>

The compatibility-decomposition element contains a sequence of characters into which the character being described can be compatibly decomposed.

The canonical-decomposition element contains the characters into which the character being described is canonically decomposed.

<!ELEMENT compatibility-decomposition (#PCDATA)> <!ELEMENT canonical-decomposition (#PCDATA)>

case can have the values UPPERCASE, TitleCase or lowercase.

combining-class encloses the combining class (in its numeric form).

directionality encloses the directionality property.

jamo-short-name encloses the Jamo short name property. It can be present only for Unicode conjoining Hangul jamo characters.

general-category

numeric-values is present if the character is a number. It encloses the numeric value as recorded in section 4.6. In addition, the attribute value is the numeric value represented as a decimal number, without ',' to separate the character groups. The attribute decimal can take the values yes or no.

mirrored is present for those characters that have the mirrored property.

mathematical is present for characters that have the mathematical property.

Open:
Look at the other properties on the Unicode cdrom in proplist.txt.

<!ELEMENT case (#PCDATA)> <!ELEMENT combining-class (#PCDATA)> <!ELEMENT directionality (#PCDATA)> <!ELEMENT jamo-short-name (#PCDATA)> <!ELEMENT general-category (#PCDATA)> <!ELEMENT numeric-values (#PCDATA)> <!ATTLIST numeric-values value CDATA #IMPLIED> decimal CDATA #IMPLIED> <!ELEMENT mirrored EMPTY> <!ELEMENT mathematical EMPTY>

The informative-note element contains an informative note.

Open:
What should be the DTD in there? A fragment of docbook? The itsy bitsy dtd?

<!ELEMENT informative-note ?>

Finally, these elements are assembled in a character element:

<!ELEMENT character ((unicode-name | name), unicode-1.0-name?, alternative-names?, code, char, cross-references?, compatibility-decomposition?, canonical-decomposition?, case?, combining-class?, directionality?, jamo-short-name?, general-category?, number-values?, mirrored?, mathematical?)>

Here are some examples:

<character>
  <name>LATIN CAPITAL LETTER A</name>
  <code>U+0041</code>
  <char>A</char>
  <direction>LR</direction>
</character>

<character>
  <name>COMBINING REVERSE SOLIDUS OVERLAY</name>
  <code>U+E000</code>
  <char></char>
  <combining_class>1</combining_class>
</character>

<character>
  <name>DOLLAR SIGN</name>
  <alternate-names>
    <name>milreis</name>
    <name>escudo</name>
  </alternate-names>
  <code>U+0024</code>
  <char>$</char>
  <direction>LR</direction>
  <cross-references>
    <cross-ref><name>currency sign</name><code>0A4</code></cross-ref>
  </cross-references>
  <informative-note>Glyph may have one or two vertical bars. other
  currency symbol characters: 20A0 ₠ - 20AF ₯</informative-note>
</character>

6. Collections

Collections are formed by grouping characters and by combining collections. A collection is well-formed iff:

No two characters have the same name, where the name of a character is defined as the value of the unicode-name element or the value of name element, whichever is present.
No two characters have the same code.

An enumerated-collection is just a set of character elements.

<!ELEMENT enumerated-collection (character)*>

A ref-collection references a external collection (that is, external to the resource in which this reference occurs). It must have a system identifier, an URI, which may be used to retrieve the referenced collection. Relative URIs are relative to the location of resource within which the ref-collection occurs. In addition, there may be a public identifier. A processor attempting to retrieve the referenced collection may use the public identifier to try to generate an alternative URI. If the processor is unable to do so, it must use the URI specified in the system identifier.

<!ELEMENT ref-collection EMPTY> <!ATTLIST ref-collection systemid CDATA #REQUIRED publicid CDATA #IMPLIED>

A union-collection groups the characters of multiple collections. If the set-wise union of those collections are not well-formed, characters of the later collections are removed from the union.

<!ELEMENT union-collection (%collection;)*>

A subsetted-collection removes some the characters of a base collection. The characters to remove are identified by their code value.

<!ELEMENT subsetted-collection (%collection;, code*)>

A remapped-collection reassigns new code points to the characters of a base collection.

<!ELEMENT remapped-collection (%collection;, %map;)>

A simple-map just lists pairs of code points. Characters which are not listed as the source of a pair are mapped to their original code point. No two pairs should map from the same character. The map should not assign two different characters to the same code point.

<!ELEMENT simple-map (replace)*> <!ELEMENT replace EMPTY> <!ATTLIST replace from CDATA #REQUIRED to CDATA #REQUIRED>

A shift-map adds an offset (positive or negative to each code point. By construction it preserves well-formedness.

<!ELEMENT shift-map (#PCDATA)> <!-- really, an integer> </code-fragment> <para>These are the only maps:</para> <code-fragment> <![CDATA[<!ENTITY % map "(simple-map|shift-map)">

And this complete the means of constructing collections:

<!ENTITY % collection "(enumerated-collection|union-collection ref-collection| subsetted-collection | remapped-collection)">

Here are some examples.

<collection>
  <union-collection>
    <ref-collection
      publicID="-//Unicode Consortium//CHC Unicode v3.0"
      systemID="ftp://ftp.unicode.org/data/chc/v3.0">

    <enumerated-collection>
      <character>
        <name>COMBINING REVERSE SOLIDUS OVERLAY</name>
        <code>U+E000</code>
        <char></char>
        <combining_class>1</combining_class>
      </character>
    <enumerated-collection>
  </union-collection>
</collection>

Here is another collection that uses the same PUA code point, but defines it differently:

<collection>
  <union-collection>
    <ref-collection
      publicID="-//Unicode Consortium//CHC Unicode v3.0"
      systemID="ftp://ftp.unicode.org/data/chc/v3.0"/>

    <enumerated-collection>
      <character>
        <name>Adobe Logo</name>
        <code>U+E000</code>
        <char></char>
        <combining_class>1</combining_class>
      </character>
    </enumerated-collection>
  </union-collection>
</collection>

Let's assume that our first collection is accessible via the URI http://atm.corp.adobe.com/chc/eric.chc and the second is accessible via the URI http://oranda.corp.adobe.com/chc/adobecorp.chc. Just forming the union of those collections will drop one of the two PUA characters (the one in the collection mentionned second). The following collection can be built for documents that need both PUA characters:

In documents that use this collection, the code point U+E000 refers to the Adobe Logo character, and the code point U+E001 refers to the COMBINING REVERSE SOLIDUS OVERLAY characters.

7. Related work

The first source of inspiration is the XML world. In an XML document, the element names that are used have no particular meaning by themselves, just like the PUA code points have no meaning. But in the XML world, this is the norm rather than the exception and mechanisms have been designed to cope with that. In fact, these were a major source of inspiration: DTD and XML schemas are similar character collections, namespaces correspond to the collection bases, and the collection naming and referencing is based on DTD naming and referencing.

The W3C NOTE A Notation for Character Collections for the WWW by Martin Dürst is an XML DTD to describe sets of character code values. The main objective is to be able to answer the question "Is this character code in this collection?". Particular attention is paid to support efficient implementation when the set descriptions are resources on the network. While this is useful when the sets are made of standard characters, it's really not enough to deal with private use characters, as it does not attach a meaning to them.

The ConScript Unicode Registry by John Cowan and Michael Everson is a registry of Private Use Area uses. The goal of this effort is really to have a centralized allocation of the private use area. It does not attempt to record semantics of the characters.

Next message: Edward Cherlin: "RE: ASCII adequacy (was: RE: benefits of unicode)"
Previous message: Peter_Constable@sil.org: "Re: On the possibility of guidance code points for the PrivateUse Area"
In reply to: Peter_Constable@sil.org: "Re: On the possibility of guidance code points for the Private Use Area"
Next in thread: Timothy Partridge: "Re: On the possibility of guidance code points for the Private Use Area"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT