Authors: John I. McConnell
JohnMcCo@microsoft.com,
This memo describes a proposal to change the bidirectional category of the SOLIDUS character in the Unicode 2.0 character database. Specifically, it would change the category from European Separator to Common Separator. The effect of this change is to alter the visual order of text containing SOLIDUS and text from right-to-left writing systems such as Arabic and Hebrew. The overall intent of the proposal is to better match such behavior with user expectations and existing practice.
If the Consortium accepts the proposal, it would also require changing the entries in Table 3-5 on page 3-17 and Table 4-4 on page 4-11 of the Unicode Standard. Note that there are no changes required to the Unicode bidirectional algorithm itself.
With the introduction of the first Unicode-based software in the Middle East, users now have some experience with conversion of existing data to Unicode. Although the transition has been smooth, there have been some difficulties with fractions.
This section shows the effect of the proposed changes on two important cases: fractions and dates. In each test case we follow the same conventions as the Unicode 2.0 book, that is, uppercase letters correspond to strong right-to-left characters whereas lowercase letters correspond to strong left-to-right characters. In addition, we have also included examples using Arabic and Hebrew text. In all the examples except as noted, the embedding level is right-to-left. Results that differ from the current values in Unicode 2.0 are shaded.
The proposed change effects only the resolution of weak neutrals in steps P0 through P5 of the Unicode Bidirectional Algorithm. This limits the changes of behavior to cases where SOLIDUS is adjacent to numbers.
Table 2 Fractions
Logical Order |
Current Visual Order |
Proposed Visual Order |
ADD 1/2 CUP (Arabic) |
PUC 2/1 DDA |
PUC 1/2 DDA |
ADD 1/2 CUP (Hebrew) |
PUC 1/2 DDA |
PUC 1/2 DDA |
There are many date formats but the proposed changes would affect one frequently used form.
Table 3 Dates
Logical Order |
Current Visual Order |
Proposed Visual Order |
MEET ON 01/23/45 (Hebrew) |
01/23/45 NO TEEM |
01/23/45 NO TEEM |
MEET ON 01/23/96 (Arabic) |
96/23/01 NO TEEM |
01/23/96 NO TEEM |
Without explicit formatting, it is impossible for both dates and fractions to display properly. Although the date change is undesirable, our users would prefer to have fractions correct rather than dates. There seem to be several reasons for this preference:
Although there are some tradeoffs, the authors believe that this proposal would more closely match user expectations for visual order of right-to-left text and expedite the development of software for regions that use such text. This improvement would promote the acceptance of Unicode for an important emerging software market.
------ =_NextPart_000_01BCF513.71F49C60 Content-Type: text/html; name="Mirroring.htm" Content-Disposition: attachment; filename="Mirroring.htm"Proposed Correction to Mirroring List
Both Unicode 2.0 and ISO 10646 define a normative list of mirrored characters. We believe that four characters have been omitted from these lists. Specifically, the four characters in Table 1 should be added to the lists of characters with the mirroring property.
Code Point |
Glyph |
Unicode 2.0 Name |
0x00AB |
« |
LEFT-POINTING DOUBLE ANGLE QUOTATION MARK |
0x00BB |
» |
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK |
0x2039 |
‹ |
SINGLE LEFT-POINTING ANGLE QUOTATION MARK |
0x203A |
› |
SINGLE RIGHT-POINTING ANGLE QUOTATION MARK |
Although the use of these characters varies, the mirroring behavior is unambiguous. For example, those printing traditions that use the left-pointing quotation mark to begin a left-to-right quotation use the right-pointing quotation mark to begin a right-to-left quotation and vice versa.
This correction would also reconcile the mirroring behavior of these characters with their cross-referenced characters such as 0x226A MUCH LESS THAN and 0x300A LEFT DOUBLE ANGLE-BRACKET. All of these related characters are listed as mirroring.
The effect of the correction would be to add these four characters to table 4-7 in the Unicode 2.0 book.
------ =_NextPart_000_01BCF513.71F49C60 Content-Type: text/html; name="SansSolidus.htm" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="SansSolidus.htm"Authors: John I. McConnell
JohnMcCo@microsoft.com,
This memo describes a proposal to change the value of four entries in the Unicode 2.0 character database. The latest version of this database is available from the Unicode Consortium web site at
= ftp://ftp.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt<= /FONT>. Specifically, the proposal changes the bidirectional category of four characters. Table 1 lists these changes.Table 1 Proposed Changes
Code Value |
Glyph |
Unicode 2.0 Character Name |
Current Bidirectional Category |
Proposed Bidirectional Category |
U+002E |
. |
FULL STOP |
European Separator |
Common Separator |
U+2007 |
|
FIGURE SPACE |
European Separator |
Common Separator |
U+0026 |
& |
AMPERSAND |
Strong Left-to-Right |
Other Neutral |
U+0040 |
@ |
COMMERCIAL AT |
Strong Left-to-Right |
Other Neutral |
The effect of these changes is to alter the visual order of text containing these characters and text from right-to-left writing systems such as Arabic and Hebrew. The overall intent of the proposal is to better match such behavior with user expectations and existing practice.
The change to FULL STOP improves the display of decimal numbers. The changes to AMPERSAND and COMMERCIAL AT improve the display of certain email addresses and URLs. The change to FIGURE SPACE is for consistency with other characters and has no significant effect on existing data. All four changes affect only the resolution of weak neutrals (steps P0 through P3). This limits the changes of behavior to cases where these characters are adjacent to numbers.
If the Consortium accepts the proposal, it would also require changing a few entries in Table 3-5 on page 3-17 and Table 4-4 on page 4-11 of the Unicode Standard. Note that there are no changes required to the Unicode bidirectional algorithm itself.
One of the implicit design criteria for the Unicode Bidirectional Algorithm is that most text should not require explicit directional formatting codes. The current bidirectional category assignments reflect the designers’ best attempt to meet that criterion. However since the publication of Unicode, there have been two developments that alter the basis for those initial assignments.
The first development is that the very first Unicode-based software, for example Microsoft Office, has entered the regions and customers have had a chance to voice their opinions. In particular, users have now had a chance to see the results of conversion of existing documents to Unicode. Although in general the conversion has gone smoothly, users have complained about some specific behaviors and the corruption of their data.
The second development is the growth of the PC and Internet. These phenomena are worldwide and have affected popular culture in many regions. In particular, they has introduced important new uses for some characters that were formerly rare or non-existent in the local writing system. For example, the COMMERCIAL AT, while originally a Latin-specific character, has become common worldwide because of its use within email names. Right-to-left text within a URL is still relatively rare and we do not claim that our proposal is a general solution. But our proposal would improve the display of the most common occurrence today namely right-to-left user names and text queries.
Although it is always a serious matter to change a standard, the Consortium has a small window of opportunity to do so now with minimum detriment. The amount of affected software is still small. There are currently few Java implementations with bidirectional support on the market although several are in late stages of development. There are proposals to extend URLs to UTF-8 but few commercial implementations yet. In a few months it may well be too late.
This section shows the effect of the proposed changes on several important cases. In each test case we follow the same conventions as the Unicode 2.0 book, that is, uppercase letters correspond to strong right-to-left characters whereas lowercase letters correspond to strong left-to-right characters. In addition, we have also included examples using Arabic and Hebrew text. In all the examples except as noted, the embedding level is right-to-left. Results that differ from the current values in Unicode 2.0 are shaded.
Customer feedback has shown that both the COMMA and the FULL STOP are used as decimal points in Arabic. Currently, the COMMA has the bidirectional category COMMON SEPARATOR whereas the FULL STOP has the category EUROPEAN SEPARATOR. The proposal would give both characters the category COMMON SEPARATOR.
Table 2 Decimal Numbers
Logical Order |
Current Visual Order |
Proposed Visual Order |
ADD 0.5 CUPS (Arabic) |
SPUC 5.0 DDA |
SPUC 0.5 DDA |
ADD 0.5 CUPS (Hebrew) |
SPUC 0.5 DDA |
SPUC 0.5 DDA |
Of the proposed changes, only the COMMERCIAL AT sign has a significant effect on the layout of email addresses. Although the use of right-to-left characters is non-standard, there is growing use of Arabic characters in the Middle East for the username portion of the address, especially on intranets. The domain names remain left-to-right.
Table 3 Email Addresses
Logical Order |
Current Visual Order |
Proposed Visual Order |
ALI@unicode.org |
@unicode.comILA |
According to RFC 1738 section 2.2, the seven characters listed in Table 9 have special meaning within a URL. These characters are reserved exclusively for use by schemes. There are at least two proposals to extend the URL syntax to the entire Unicode repertoire using UTF-8. Commercial implementations are likely to appear within a year. With the advent of right-to-left characters within URLs, proper display would require changes to the bidirectional category. This proposal would have the effect of making all of these special characters neutrals. This would reduce the need for explicit formatting characters in URLs.
Table 4 URL Reserved Characters
Code Value |
Glyph |
Unicode 2.0 Character Name |
Current Bidirectional Category |
Proposed Bidirectional Category |
U+0026 |
& |
AMPERSAND |
Strong Left-to-Right |
Other Neutral |
U+002E |
. |
FULL STOP |
European Separator |
Common Separator |
U+002F |
/ |
SOLIDUS |
European Separator |
Common Separator |
U+003A |
: |
COLON |
Common Separator |
Common Separator |
U+003E |
=3D |
EQUALS SIGN |
Other Neutral |
Other Neutral |
U+003F |
? |
QUESTION MARK |
Other Neutral |
Other Neutral |
U+0040 |
@ |
COMMERCIAL AT |
Strong Left-to-Right |
Other Neutral |
There are many schemes that use these characters so it is impossible to list all test cases. However a typical use would be to separate parameters. For example a query to a search engine might use the ‘&’ to separate the search parameter from other parameters.
Table 5 URL Scheme
Logical Order |
Current Visual Order |
Proposed Visual Order |
…&query=3DALI&… |
…&ILA=3D&query |
…&ILA=3Dquery&… |
Microsoft Windows assigns a special role to the ampersand in resources such as menu items and dialogs: the character following the ampersand is the keyboard shortcut for that item. Unfortunately the strong left-to-right attribute of the ampersand causes such resource files to print improperly when localizers are editing the resources. Although this is platform specific, Windows software accounts for an enormous percentage of localized software in right-to-left scripts and many third-party products and tools depend on this behavior. Changing this would remove a considerable obstacle to localization of software for the Middle East.
Table 6 Resource Shortcuts
Logical Order |
Current Visual Order |
Proposed Visual Order |
|
&TNIRP |
TNIRP& |
The authors believe that this proposal would more closely match user expectations for visual order of right-to-left text and expedite the development of software for regions that use such text. This improvement would promote the acceptance of Unicode for an important emerging software market.