From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Nov 26 2004 - 08:04:21 CST
From: "Doug Ewell" <dewell@adelphia.net>
> My impression is that Unicode and ISO/IEC 10646 are two distinct
> standards, administered respectively by UTC and ISO/IEC JTC1/SC2/WG2,
> which have pledged to work together to keep the standards perfectly
> aligned and interoperable, because it would be destructive to both
> standards to do otherwise.  I don't think of it at all as the "slave and
> master" relationship Philippe describes.
Probably not with the assumptions that one can think about "slave and 
master", but it's still true that there can only be one standard body for 
the character repertoire, and one formal process for additions of new 
characters, even if two standard bodies are *working* (I don't say *decide*) 
in cooperation.
The alternative would have been that UTC and WG2 are allocated each some 
code space for making the allocations they want, but with the risk of 
duplicate assignments. I really prefer to see the system like the "master 
and slave" relationships, because it gets a simpler view for how characters 
can be assigned in the common repertoire.
For example, Unicode has no more rights than national standardization bodies 
making involved at ISO/IEC WG2. All of them will make proposals, all of them 
will amend proposals, or suggest modifications, or will negociate to create 
a final specification for the informal drafts. All what I see in the Unicode 
standardization process is that it will finally approve a proposal, but 
Unicode cannot declare it standard until there's been a formal agreement at 
ISO/IEC WG2, which really rules the effective allocations in the common 
repertoire, even if most of the preparation work will have been heavily 
discussed within UTC, creating the finalized proposal and with Unicode 
partners or with ISO/IEC members.
At the same time, ISO/IEC WG2 will also study the proposals made by other 
standardization bodies, including the specifications prepared by other ISO 
working groups, or by national standardization bodies. Unicode is not the 
only approved source of proposals and specifications for ISO/IEC WG2 (and I 
tend to think that Unicode best represent the interests of private 
companies, whilst national bodies are most often better represented by their 
permanent membership at ISO where they have full rights of voting or vetoing 
proposals, according to their national interests...)
The Unicode standard itself agrees to obey to ISO/IEC 10646 allocations in 
the repertoire (character names, representative glyphs, code points, and 
code blocks), but in exchange, ISO/IEC has agreed with Unicode to not decide 
about character properties or behavior (which are defined either by Unicode, 
or by national standards based on the ISO/IEC 10646 coded repertoire, for 
example the P.R.Chinese GB18030 standard, or by other ISO standards like ISO 
646 and ISO 8859).
So, even if the UTC decides to veto a proposal submitted by Unicode members, 
nothing prevent the same members to find allies within national standard 
bodies, so that they submit the (modified) proposal to ISO/IEC 10646, 
instead of Unicode which refuses to transmit that proposal.
I want to demonstrate some recent example: the UTC decided to vote against 
the allocation of a new invisible character, with the properties of a 
letter, a zero-width, and the same allowances of break opportunities as 
letters, considering that the existing NBSP was enough, despite it causes 
various complexities related to the normative properties of NBSP used as a 
base character for combining diacritics. This proposal (that was previously 
in informal discussion) has been rejected by UTC, but this leaves Indian and 
Israeli standards with complex problems for which Unicode proposes no easy 
solution.
So nothing prevents India and Israel to reformulate the proposal at ISO/IEC 
WG2, which may then accept it, even if Unicode previously voted against it. 
If ISO/IEC WG2 accepts the proposal, Unicode will have no other choice than 
accepting it in the repertoire, and so giving to the new character some 
correct properties. Such proposal will be easily accepted by ISO/IEC WG2 if 
India and Israel demonstrate that the allocation allows making distinctions 
which are tricky or computationnally difficult or ambiguous to resolve when 
using NBSP. With a new distinct character, on the opposite, it can be 
demonstrated by ISO/IEC 10646 members to Unicode that defining its Unicode 
properties is not difficult, and simplifies the problem for correctly 
representing complex cases found in large text corpus.
Unicode may think that this is a duplicate allocation, because there will 
exist cases where two encoding are possible, but without the same 
difficulties for implementations of applications like full-text search, 
collation, or determination of break opportunities, notably in the many 
cases where the current Unicode rules are already contradicting the 
normative behavior of existing national standards (like ISCII in India). My 
opinion is that the two encodings will still survive, but text encoded with 
the new prefered character will be easier to process correctly, and over 
time, the legacy encodings using NBSP would be deprecated by usage, making 
the duplicate encodings less a critical issue for many applications that are 
written for simplicity using partial implementations of the Unicode 
properties... Legacy encodings will still exist, but users of these encoded 
texts will be given an optional opportunity to recode their texts to match 
with the new prefered encoding, without changing their applications.
Unicode already has tons of possible apparent duplicate encodings (see for 
example the non-canonically equivalent strings that can be created with 
multiple diacritics with the same combining class, despite they can't be 
made visually distinct, for example with some indian vowels, or with the 
presentation of some diacritics like the cedilla on some Latin letters; see 
also the characters that should have been defined as canonically equivalent 
but are not now, because Unicode has made string equivalence classes 
irrevokable, i.e. "stable", within an agreement signed with other standard 
bodies). Some purists may think that adding new apparent duplicates is a 
problem, but it will be less a problem if the users of the national 
standards directly used when using some scripts are exposed to tricky 
problems or ambiguities with the legacy encoding, that simply don't appear 
with the encoding using the new separate allocation.
The interests of Unicode and ISO/IEC 10646 are diverging: Unicode is working 
so that the common repertoire can be managed in existing softwares created 
by its private members, but ISO/IEC 10646 members are first concerned by the 
correct representation of their national languages, without loss of 
semantics.
In some cases, this correct representation conflicts with the simplest forms 
of implementations in Unicode-enabled softwares, requiring unjustified uses 
of large datasets for handling many exceptions, the absence of this dataset 
meaning that the text will be given wrong interpretations, so that text 
processing looses or changes parts of its semantics. (Note that many of the 
ambiguities come from the Unicode standard itself, which is the case for the 
normative behavior of NBSP at the begining of a word, or after a breakable 
SPACE... sometimes because of omissions in past versions of the standard, or 
because of unsuspected errors...)
The easiest solution to this problem: make it simpler to handle, using 
separate encodings when this solves the difficult ambiguities (notably if 
there are ambiguities about which Unicode version considered when the text 
was encoded, or one of its addenda or corrigenda); then publish a guide that 
makes clearly separate interpretations (semantics) for texts coded with the 
legacy character, and texts coded with the new apparent "duplicate" 
character.
The complex solution is to modify Unicode algorithms, and this may be even 
more difficult if this is part of the Unicode core standard, or in one of 
its standard annexes, or involves one of the normative character properties 
(like general classes, or combining classes), or the script classification 
of characters (script-specific versus common).
This archive was generated by hypermail 2.1.5 : Fri Nov 26 2004 - 12:30:20 CST