Though there is a clear distinction between a concept of character and a concept of glyphic representation, in natural language processing, depending on the field of operation, same text stream often requires different operational domains. Representative two domains are as follows:
Search and Retrieval, sort, natural language processing, machine translation.
A person's name, a place name, a geographical dictionary, a biographical dictionary, historical/geographical description.
A different methodology applies to a different field, however, the appearance-based processing domain is often misunderstood as a simple font typeface issue within the character-based processing domain. The requirements in the appearance-based processing domain will not satisfactory be fulfilled by ordinary font typeface settings in higher level protocol and even worse, there still be a strong demand on the appearance-based processing domain for plain text. The Unicode Standard Variation Selectors is a great step toward the solution to the issues with appearance-based processing domain, however, the requirements in this domain with Han ideographs involve further enhancement of the Unicode standard due to the nature of diverse usage of variations as common practice.
To resolve the problems and requirements above, it is essential to introduce a new mechanism in Unicode Standard.There need be a single comprehensive and definitive variation set and base characters definition as the well maintained standard which all whom need to resolve the problems in appearance-based processing domain, including national bodies who use ideographs and include ideographs in their national standards as well as all different research and study purposes, if we have to go with single set of variations as currently defined in Unicode 4.0 standard. On the other hand, huge comprehensive variation collection does not fit with the needs of whom only requires target specific smaller but efficient set of variations due to the cost of system and effectiveness, in both data processing and rendering, as too much is as bad as too little. There had been several attempts in academic field and national standard field to address the issues of the appearance-based processing domain, and the study shows it is practically too costly to develop such comprehensive one-fits-all set, if not impossible to develop. The studies show the fact that the taxonomy of ideograph variations are diverse due to its historical and geographical background. One taxonomy classified based on particular set of historical document may not necessarily suitable for the use with other historical document written in different geographic region and/or in different era. Requirements vary depending on the parties needing own variations and often are conflicting among them.
Variation Collection |
Ideograph variation selector, proposed to be allocated in plane 14, can also change the visual appearance of the preceding base character as variation selector does, however, sequences involving those characters are not defined in the file Standardized-Variants.txt in the Unicode Character Database. Instead, the meaning of such a variation sequence is defined by an outside agreement between producers and consumers, called a Ideograph variation registry.
The ideograph variation registry consists of two information.
File 1 contents: Adobe;Glyph Collection A;http://www.adobe.com/... JustSystem;Glyph Collection N;http://www.justsystem.co.jp/... File 2 contains: U+4E00 U+IVS17;JustSystem U+4E00 U+IVS18;JustSystem U+4E00 U+IVS19;Adobe U+4E01 U+IVS17;Adobe
Registration Authority works as an arbitrator of name space
management as well as the central directory to find out the
registered Ideograph Variation Selectors.
The registration authority only guarantees that for any given Unicode Han
character NNNN, there is only one registered sequence NNNN + IVSn. That is,
uniqueness of the *sequence* is guaranteed.
The registration authority does not test or guarantee that two sequences
S1 and S2 have distinct or distinguishable glyphs.
To make sure that piles of "garbage" don't get registered, the
registration authority first qualifies the submitters in some way. Once
qualified, a submitter can submit any number of sequences. The work of
identifying, cataloging, and sequencing is done by the registrant, not by
the registration authority. The registration authority just checks to see
that the sequences are unique.
The registrant is strongly recomended to maintain pictures
of variation in some form they register and any relevant reference
information so that the variations they register can be identified
by whom interested in.
Ideograph Variation Registrant can choose the method of registration from two ways.
Any systems do not recognize this variation selector and variation set identifier sequence can ignore them and use its system default behavior.
The other properties for Ideograph Variation Selectors are:
general category Mn canonical combining class 0 bidirectional category NSM no decomposition mapping no numeric value not mirrored no case mapping
The definition of base character has been one of the biggest problems for the earlier versions of Ideograph variation selector, because defining single set of base characters itself is yet another controversial thesis which many reseachers study in academic field. However, it is no longer essential to identify which character is worth for being base character. Allowing mutiple variation collections identified by Variation set identifier eliminates the dependency on the single definitive set of base character and variation collection which require the absolute accuracy with universal validity.
The base character in a variation sequence using Ideograph Variation Selectors must be a Ideograph(CJK UNIFIED IDEOGRAPH *, and COMPATIBILITY IDEOGRAPH, including them in plane2). NON-CHARACTER and RESERVED in plane2 are excluded.
[2003-11-5] Idea to allocate variation sequences in relatively wider code space is introduced for eliminate statefull variation collection in IVS ad-hoc meeting during UTC. [2003-8-28] The idea of using existing variation selector by changing property instead of creating separate set of ideograph variation selector is the contribution form Rick McGowan, Kenneth Whistler.
[2003-8-29] The idea of using end tag instead of leaving the end of ideograph variation collection unspecified is the contribution form Rick McGowan.
[2003-8-29] The idea of separating registration authority discussion from this proposal is the contribution form Mike Ksar
[2003-9-16] The idea of inclusion of the use of URI as normative part of this specification instead of separate specification for the case that skipping registration authority is the contribution from Takayuki K Sato
[2003-10-23] Kenneth Whistler suggested to spell out clearly that the Variation collection identifier using Tag Character is also a higher level protocol implemented in plain text.
[2003-10-23] Kenneth Whistler suggested to split this proposal into two; One for IVS semantics change proposal only, one for the rest, in order for advancing this proposal to UAX/UTS/UTR.
Hideki Hiura, OpenI18N.org, Sun Microsystems,
Tatsuo Kobayashi, Justsystem
Yasuo Kida, Apple Computer
Eric Muller, Adobe Systems
Ken Lunde, Adobe Systems
Michel Suignard, Microsoft Corp
John Jenkins, Apple Computer
Rick McGowan, Unicode Consortium
Kenneth Whistler, Sybase
Mike Ksar, Microsoft Corp
Richard Cook, UC Berkeley
Tom Bishop, Wenlin Institute, Inc.
Dirk Meyer, Adobe Systems
John Renner, Adobe Systems
Deborah Goldsmith, Apple Computer
Yasuhira Anan, Microsoft Corp
Cathy Wissink, Microsoft Corp
Jim DeLaHunt, Adobe Systems
Lee Collins, Apple Computer