From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Dec 07 2007 - 15:09:23 CST
Karl Pentzlin asked:
> This leads to my questions:
> a.) Why U+FD3E has GC property Ps and U+FD3F has Pe, and not vice
> versa?
> b.) Why U+FD3E and U+FD3F have the Bidi_mirroring property not set?
There is lots of history for this -- some of which was
hinted at by the other responses.
The most recent round on this dates to PRI #80, which was
closed in February 2006. PRI #80 requested feedback on a
proposal to change the Bidi_Mirrored property for a bunch
of characters that had until then been Bidi_Mirrored=False,
including a number of directional quotation marks and
these two ornate parentheses.
Part of the feedback received on PRI #80 was a specific
request *not* to change the Bidi_Mirrored property for the two
parentheses -- feedback received from Roozbeh Pournader on
behalf of the High Council of Informatics of Iran -- because
doing so would invalidate all existing usage of them
in Persian. It would put them in conflict with the Iranian
standard, which had been deliberately aligned with Unicode
for bidirectional behavior.
As a result of that feedback, in the resolutions dealing
with PRI #80, the UTC explicitly decided to exempt the
two parentheses from the change proposed for the other
characters (mostly the quotation marks):
"106-M3 Motion: Drop U+FD3E ORNATE LEFT PARENTHESIS and
U+FD3F ORNATE RIGHT PARENTHESIS from the list of
characters with Bidi_Mirrored property proposed in Public
Review Issue 80."
That motion carried. And as a result those characters were
not changed to Bidi_Mirrored=True in Unicode 5.0.
That fact that (controversially) some quotation marks were
changed to Bidi_Mirrored=True as a result of PRI #80, and
that that decision was itself later reversed is related to the
question of the Bidi_Mirrored property for the two ornate
parentheses, of course, but the actual decision paths taken
for the two different sets of characters forked as of
that February, 2006 motion.
As to the *ancient* history here, the fact is that the ornate
parentheses have never changed property values in the standard.
UnicodeData-2.0.14.txt (July 1996)
FD3E;ORNATE LEFT PARENTHESIS;Ps;0;ON;;;;;N;;;;;
FD3F;ORNATE RIGHT PARENTHESIS;Pe;0;ON;;;;;N;;;;;
UnicodeData-5.0.0.txt (July 2006)
FD3E;ORNATE LEFT PARENTHESIS;Ps;0;ON;;;;;N;;;;;
FD3F;ORNATE RIGHT PARENTHESIS;Pe;0;ON;;;;;N;;;;;
The assignment of gc=Ps and gc=Pe was not a mistake. All
left/right parenthesis and bracket pairs are given gc=Ps for
the "LEFT" of the pair and gc=Pe for the "RIGHT" of the pair.
The reason why these two were not given the Bidi_Mirrored
property in the first place is pretty simple: it was an
bit of a catch-22 in late 1995 when data files for Unicode 2.0
were being developed. UnicodeData-2.0.14.txt was synched
with what became Table 4-7, Mirrored Characters, in the printed text of
Unicode 2.0, pp. 4-22..4-25. Significantly, *that* also
had to be synchronized with Annex C in ISO/IEC 10646-1:1993,
which *also* provided an explicit list of "Mirrored characters
in Arabic bi-directional context". And *that* corresponded
to the list in Appendix G, "Symmetric Swapping Characters"
in Unicode 1.1 (1993), which was deliberately checked to
assure it was the same as Annex C in 10646-1:1993.
The machine-readable form of this goes back to June, 1994, when
the availability of what is now archived as UnicodeData-1.1.5.txt
was announced:
"We would like to announce the availability of a new updated and comprehensive
data file for Unicode characters. This file not only includes all the
information in previous names list, but also incorporates the Unicode 1.1 data
for each character: canonical ordering priority, bidirectional categories,
character decomposition, numeric values, symmetric swapping. It also adds new
^^^^^^^^^^^^^^^^^^
information on character categorization and upper/lower/title case mappings."
But the list goes back way further. The list in 10646-1:1993
was published in DIS 10646-1.2:1992 (26 December 1991), so
it was actually developed in 1991, during the period of
hectic and shall we say "stimulating" ;-) merger discussions
for Unicode and 10646.
The existence of FD3E and FD3F as a pair of parentheses in
the then-draft standard was simply overlooked by the people
working on finding all the mirrored characters
for the 10646-1 Annex listing them. In fact the *parenthesis* and
*bracket* characters were also explicitly listed in the normative
Clause 20 of the 10646 standard at the time, as well. This
was being done at a time when there were still no machine-readable
files for character properties, nor any explicit written
out specification of the Bidirectional Algorithm.
The whole issue was murkified at the time because there was
an argument going on whether compatibility with
implementations that controlled glyph swapping with explicit
control codes was also required.
DIS 10646-1.2:1992 did not contain symmetric swapping controls,
but I note the following comment in minutes from the
UTC New Scripts Committee meeting of May 2, 1992:
"A lively discussion of mirrored characters in Unicode BIDI ensued. The
subcommittee decided that although we believe codes to control mirroring are
at the wrong level, we would agree to support adding them if required in
order to make the merger with 10646 work. These additions were proposed by
Israel, as amended by Isai Scheinberg."
And, in fact that was the origin of the following two characters
which ended up in 10646-1:1993:
U+206A INHIBIT SYMMETRIC SWAPPING
U+206B ACTIVATE SYMMETRIC SWAPPING
Of course those characters were always deprecated from the point
of view of the Unicode Bidirectional Algorithm, but the fact
that people were still arguing about including them as of 1992
indicates that at the time the architecture for bidirectional
behavior was still somewhat in play, and people tended to focus
on that to the detriment of detailed examination of consistency
of property assignments.
Another thing to note is that as of 1993, characters higher
than U+DFFF were still officially designated as the "R-zone"
(or "Restricted Use zone") -- a kind of toxic zone where
private use, presentation forms, and compatibility characters
lived their stunted lives. It would have been difficult at the
time to get people to take those characters as serious
participants in property lists, particularly ones listed
as "presentation forms", which were taken just to be glyphs,
basically. It was only later that implementations started
cherry-picking the R-zone for "real" characters, and eventually
the R-zone designation itself was dropped and many other
characters started getting encoded in the high parts of the BMP.
So, to make a very, very long story somewhat shorter, there
was an early convergence of effects that resulted in the
situation where FD3E/FD3F did not have Bidi_Mirrored=True
as of Unicode 2.0:
1. Oversight by those drafting the initial mirrored list for 10646.
2. The fact that these were encoded among Arabic presentation
forms in the R-zone, which was a strike against them even
if they had been noticed.
3. Concern by the UTC in the development of Unicode 2.0 to
focus on consolidation and correct synchronization, rather
than putting too many issues in play by trying to "fix" things
all at once, particularly for Bidi, which was difficult
and painful at the time. (When isn't it? *hehe*)
There was a window of opportunity between Unicode 2.0 and
Unicode 3.0, when the UTC *did* focus on updating and correcting
a lot of character properties, and adding lots of new ones.
And during that time, the process for synchronizing work between
WG2 and UTC started to get ironed out, so it probably would have
been possible to fix the Bidi_Mirrored property for FD3E/FD3F
then, had anybody actually noticed.
But after the publication of Unicode 3.0, implementation
stability started to matter more and more with the passing
years. Normalization was already locked down, and bidirectional
properties started to get hard to modify. By now, implementations
are widespread enough, and there is enough data that essentially
nobody wants to "fix" things any more. It is far less damaging
to live with past mistakes in properties than to try to "fix" them now --
which tends to have the paradoxical effect of breaking
implementations and invalidating data.
Or to put it very shortly, this saga is another illustration
of the Mick Jagger Principle of Character Encoding:
You can't always get what you want!
--Ken
This archive was generated by hypermail 2.1.5 : Fri Dec 07 2007 - 15:12:26 CST