L2/04-050

Ideograph Variation Selector and Variation Collection Identifier

Summary:	Proposal to reserve Ideograph Variation Selector block in plane 14 for registration.
Version:	0.8
Last Updated:	2004-1-30
Editor:	Hideki Hiura
Authors:	Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins
Key Contributors:	Rick McGowan, Kenneth Whistler, Richard Cook, Tom Bishop, Michel Suignard, Takayuki K Sato
Latest Version:	0.8 2004-1-30 http://www.openi18n.org/spec/ivs/
Older Versions:	0.7 2003-10-31 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins 0.6 2003-09-20 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins 0.5 2003-09-17 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins 0.4 2003-09-01 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins 0.3 2003-08-28 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida 0.2 1998-07-29 L2/98-??? Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida 0.1 1997-12-01 L2/97-260 Hideki Hiura, Tatsuo Kobayashi
Related Contributions:	2003-08-23 L2/03-293 Eric Muller, Ken Lunde
Feedback:	ivs-feedback@openi18n.org(This address is only available during the open feedback period.)
Discussion List:	ivs@unicode.org(Only subscribers can post. To subscribe, send an empty email to ivs-ctl@unicode.org, and follow the instruction emailed back.)
Namespace:	http://www.openi18n.org/spec/ivs
Status:	This document is a proposed update of a part of Unicode 4.0 Standard, with a proposed addition. This is an unstable document and may not be used as reference material or cited as a normative reference by other specifications.

1. Introduction

Though there is a clear distinction between a concept of character and a concept of glyphic representation, in natural language processing, depending on the field of operation, same text stream often requires different operational domains. Representative two domains are as follows:

Character/contents-based processing domain:

Search and Retrieval, sort, natural language processing, machine translation.

Appearance-based processing domain:

A person's name, a place name, a geographical dictionary, a biographical dictionary, historical/geographical description.

A different methodology applies to a different field, however, the appearance-based processing domain is often misunderstood as a simple font typeface issue within the character-based processing domain. The requirements in the appearance-based processing domain will not satisfactory be fulfilled by ordinary font typeface settings in higher level protocol and even worse, there still be a strong demand on the appearance-based processing domain for plain text. The Unicode Standard Variation Selectors is a great step toward the solution to the issues with appearance-based processing domain, however, the requirements in this domain with Han ideographs involve further enhancement of the Unicode standard due to the nature of diverse usage of variations as common practice.

1.1 Problems and Requirements in use of Variation Selector for Ideographs

To fulfill all the requirements collectively in appearance-based processing domain for Ideograph with current Variation Selectors, the following problems and requirements need to be resolved;

There need be a single comprehensive and definitive variation set and base characters definition as the well maintained standard which all whom need to resolve the problems in appearance-based processing domain, including national bodies who use ideographs and include ideographs in their national standards as well as all different research and study purposes, if we have to go with single set of variations as currently defined in Unicode 4.0 standard.
On the other hand, huge comprehensive variation collection does not fit with the needs of whom only requires target specific smaller but efficient set of variations due to the cost of system and effectiveness, in both data processing and rendering, as too much is as bad as too little.
There had been several attempts in academic field and national standard field to address the issues of the appearance-based processing domain, and the study shows it is practically too costly to develop such comprehensive one-fits-all set, if not impossible to develop.
The studies show the fact that the taxonomy of ideograph variations are diverse due to its historical and geographical background. One taxonomy classified based on particular set of historical document may not necessarily suitable for the use with other historical document written in different geographic region and/or in different era.
Requirements vary depending on the parties needing own variations and often are conflicting among them.

To resolve the problems and requirements above, it is essential to introduce a new mechanism in Unicode Standard.

1.3 Definition

Variation Collection

2. Ideograph Variation Selector for Registration

2.1 Ideograph variation selector

Ideograph variation selector, proposed to be allocated in plane 14, can also change the visual appearance of the preceding base character as variation selector does, however, sequences involving those characters are not defined in the file Standardized-Variants.txt in the Unicode Character Database. Instead, the meaning of such a variation sequence is defined by an outside agreement between producers and consumers, called a Ideograph variation registry.

2.2 Ideograph variation registry

The ideograph variation registry consists of two information.

Variation sequence(U+XXXX U+IVSNNN)

the registrant unique ID(such as name and URI)

For example, those two information can be maintained in two files:

                                                                               
        File 1 contents:
                Adobe;Glyph Collection A;http://www.adobe.com/...
                JustSystem;Glyph Collection N;http://www.justsystem.co.jp/...
                                                                               
        File 2 contains:
                U+4E00 U+IVS17;JustSystem
                U+4E00 U+IVS18;JustSystem
                U+4E00 U+IVS19;Adobe
                U+4E01 U+IVS17;Adobe

2.2 Ideograph variation Registration Authority

Registration Authority works as an arbitrator of name space management as well as the central directory to find out the registered Ideograph Variation Selectors.
The registration authority only guarantees that for any given Unicode Han character NNNN, there is only one registered sequence NNNN + IVSn. That is, uniqueness of the *sequence* is guaranteed.
The registration authority does not test or guarantee that two sequences S1 and S2 have distinct or distinguishable glyphs.

2.3 Ideograph Variation Registrant

To make sure that piles of "garbage" don't get registered, the registration authority first qualifies the submitters in some way. Once qualified, a submitter can submit any number of sequences. The work of identifying, cataloging, and sequencing is done by the registrant, not by the registration authority. The registration authority just checks to see that the sequences are unique.
The registrant is strongly recomended to maintain pictures of variation in some form they register and any relevant reference information so that the variations they register can be identified by whom interested in.

2.4 Registration Methods

Ideograph Variation Registrant can choose the method of registration from two ways.

Submit IVS sequences for an entire collection to guarantee uniqueness

If the variation the registrant want is identified as already apparently registerd by other registrant, in condition that the registrant makes contract with the other registrant who registered the sequence to guarantee identicalness of those two variations, and reuse the same sequence to guarantee uniquness.

3 Default Ignorable

Any systems do not recognize this variation selector and variation set identifier sequence can ignore them and use its system default behavior.

4. Other Properties

The other properties for Ideograph Variation Selectors are:

 general category Mn
 canonical combining class 0 
 bidirectional category NSM 
 no decomposition mapping
 no numeric value
 not mirrored
 no case mapping

5. Base Character

The definition of base character has been one of the biggest problems for the earlier versions of Ideograph variation selector, because defining single set of base characters itself is yet another controversial thesis which many reseachers study in academic field. However, it is no longer essential to identify which character is worth for being base character. Allowing mutiple variation collections identified by Variation set identifier eliminates the dependency on the single definitive set of base character and variation collection which require the absolute accuracy with universal validity.

The base character in a variation sequence using Ideograph Variation Selectors must be a Ideograph(CJK UNIFIED IDEOGRAPH *, and COMPATIBILITY IDEOGRAPH, including them in plane2). NON-CHARACTER and RESERVED in plane2 are excluded.

6.Acknowledgements and modification history

[2003-11-5] Idea to allocate variation sequences in relatively wider code space is introduced for eliminate statefull variation collection in IVS ad-hoc meeting during UTC. [2003-8-28] The idea of using existing variation selector by changing property instead of creating separate set of ideograph variation selector is the contribution form Rick McGowan, Kenneth Whistler.

[2003-8-29] The idea of using end tag instead of leaving the end of ideograph variation collection unspecified is the contribution form Rick McGowan.

[2003-8-29] The idea of separating registration authority discussion from this proposal is the contribution form Mike Ksar

[2003-9-16] The idea of inclusion of the use of URI as normative part of this specification instead of separate specification for the case that skipping registration authority is the contribution from Takayuki K Sato

[2003-10-23] Kenneth Whistler suggested to spell out clearly that the Variation collection identifier using Tag Character is also a higher level protocol implemented in plain text.

[2003-10-23] Kenneth Whistler suggested to split this proposal into two; One for IVS semantics change proposal only, one for the rest, in order for advancing this proposal to UAX/UTS/UTR.

Ideograph Variation Selector ad-hoc gorup is:

Hideki Hiura, OpenI18N.org, Sun Microsystems,
Tatsuo Kobayashi, Justsystem
Yasuo Kida, Apple Computer
Eric Muller, Adobe Systems
Ken Lunde, Adobe Systems
Michel Suignard, Microsoft Corp
John Jenkins, Apple Computer
Rick McGowan, Unicode Consortium
Kenneth Whistler, Sybase
Mike Ksar, Microsoft Corp
Richard Cook, UC Berkeley
Tom Bishop, Wenlin Institute, Inc.
Dirk Meyer, Adobe Systems
John Renner, Adobe Systems
Deborah Goldsmith, Apple Computer
Yasuhira Anan, Microsoft Corp
Cathy Wissink, Microsoft Corp
Jim DeLaHunt, Adobe Systems
Lee Collins, Apple Computer