Revision of UBA for improved display of URL/IRIs
Contents
Recognizing IRIs in plain text
Proposed extension of UBA for bidi_IRIs
The Unicode Bidirectional Algorithm (UBA), specified in Unicode Standard Annex #9, was designed for handling ordinary text, and predated the rise of the web. Unfortunately, IRI/URLs* are not ordinary text; they are syntactically complex in ways that don’t work well with the UBA. That causes IRIs that contain right-to-left text (such as Arabic or Hebrew) to appear jumbled, to the point where the IRIs are either uninterpretable, misleading, or ambiguous. In particular the ambiguous displays could cause security problems.
*Formally speaking, what we are talking about are IRIs, although most end-users know them as URLs. For more information, see idn-and-iri.
For example, consider the IRIs in Table 1. These sample IRIs are shown in “memory” order first; uppercase bold represents right-to-left characters.
Table 1. Sample IRIs, Memory order
IRIs | Comments on fields |
http://ab.cd.com/mn/op | All LTR characters |
http://ab.cd.EF.GH.com/IJ/KL/mn/op | Mixture of RTL and LTR fields |
http://EF.GH/IJ/KL | All RTL characters (except the scheme “http”) |
They would be displayed as in Table 2 by any Unicode-compliant implementation.
Table 2. Sample IRIs, Display order
Environment | Display | Details |
LTR | http://ab.cd.com/mn/op http://ab.cd.HG.FE.com/LK/JI/mn/op http://LK/JI/HG.FE | |
RTL | http://ab.cd.com/mn/op mn/op/LK/JI/com.HG.FE.http://ab.cd |
People have been looking for an extension to the Unicode Bidirectional Algorithm (UBA) that handles IRIs in a more consistent way for bidi users. The general goal of such an extension would be for the “fields” of an IRI to flow in a consistent direction.
Deployment of such an extension requires consistency in usage across different applications. For example, when someone copies the contents of an address bar into an email, we don’t want all the fields in the IRI to switch around. Attaining such consistency would require a general extension to the UBA to indicate how IRIs should be displayed, both in the limited context of an address bar, and in plain text.
There are challenges for developing such an extension.
There are two problems:
The first issue can be addressed with the following approach. While in theory, almost any Unicode character can occur in fields in an IRI, in practice many characters have very restricted usage in IRIs. One can take advantage of this pattern by defining a restricted and simplified syntax for IRIs, which captures the majority of actual practice. This syntax can then be used to define display of IRIs for UBA. IRIs that need to use characters outside of this restricted syntax can still be appropriately displayed by the UBA by representing those characters with % escapes.
For the second issue, a common technique can be applied. The technique is to recognized a list of TLDs in context. For example, microsoft.com and google.de can both be recognized. (See http://www.iana.org/domains/root/db/).
Here is a proposed BNF for recognizing IRIs in plain text. This BNF uses a Perl-style syntax:
[[Fix formatting]]
bidi_IRI := ((scheme “://” domain) | domainWithTLD)
(“/” path)?
(“?” query)?
(“#” fragment)?
domain := label (IDNSep label)* IDNSep?
domainWithTLD := label (IDNSep label)* IDNSep TLD IDNSep?
label := UTS46Chars +
IDNSep := [\u002E \uFF0E \u3002\uFF61] // see http://unicode.org/reports/tr46/#Notation
TLD := <list on http://www.iana.org/domains/root/db/>
path := (char - “?” - “#”)*
query := (char - “#”)*
fragment := char*
char := percentEncodedUTF8
| [[:L:][:N:][:M:][:S:][:Pd:][:Pc:][:Cf:] inclusionChar - exclusionChar]
inclusionChar :=
U+0021 ( ! ) EXCLAMATION MARK
U+0022 ( " ) QUOTATION MARK
U+0023 ( # ) NUMBER SIGN
U+0025 ( % ) PERCENT SIGN
U+0026 ( & ) AMPERSAND
U+0027 ( ' ) APOSTROPHE
U+002A ( * ) ASTERISK
U+002C ( , ) COMMA
U+002E ( . ) FULL STOP
U+002F ( / ) SOLIDUS
U+003A ( : ) COLON
U+003B ( ; ) SEMICOLON
U+003F ( ? ) QUESTION MARK
U+0040 ( @ ) COMMERCIAL AT
U+005C ( \ ) REVERSE SOLIDUS
U+00A1 ( ¡ ) INVERTED EXCLAMATION MARK
U+00B7 ( · ) MIDDLE DOT
U+00BF ( ¿ ) INVERTED QUESTION MARK
exclusionChar :=
U+003C ( < ) LESS-THAN SIGN
U+003E ( > ) GREATER-THAN SIGN
There is one extra condition:
When parsing a bidi_IRI, an inclusionChar (such as “!”) is treated specially. If it is followed by a char, then it is included in the BNF. If not, it is excluded. For example, “abc.com/foo.bar” would be parsed completely as a bidi_RI, but “abc.com/foo. other text” would stop before the period.
It is possible to capture this extra condition in the BNF, but it would make the formulation far less readable.
To compare this simplified UBA syntax for a bidi_IRI with the full standard IRI syntax, see: http://rfc-ref.org/RFC-TEXTS/3987/chapter2.html
In summary, encountering any of the following classes of characters would cause parsing of a bidi_IRI in plain text to continue:
Encountering any of the following classes of characters would cause parsing of a bidi_IRI in plain text to terminate:
Many of these characters can be included in an IRI, but they would need to be % encoded for the specialized bidi display to kick in.
For ASCII and Latin1, the list of terminating punctuation consists of:
U+003C < LESS-THAN SIGN
U+003E > GREATER-THAN SIGN
U+0028 ( LEFT PARENTHESIS
U+0029 ) RIGHT PARENTHESIS
U+005B [ LEFT SQUARE BRACKET
U+005D ] RIGHT SQUARE BRACKET
U+007B { LEFT CURLY BRACKET
U+007D } RIGHT CURLY BRACKET
U+00AB « LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+00BB » RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
The proposed UBA extension would display any recognized bidi_IRI in the following way. The goal of this display is to ensure that the fields of bidi_IRI occur in a predictable, uniform order, regardless of whether each field has RTL or LTR characters.
Given a bidi_IRI:
A separator is defined as any instance of the quoted strings in the bidi_IRI BNF:
A field is defined as any text between separators, or at the front or end.
Each bidi_IRI is displayed with fields from left to right. Thus the examples shown above in Table 2 will instead appear with a predicatible order of their fields. The order is the same whether the IRI is in a RTL or in a LTR environment (paragraph).
Table 3. Constant order
Environment | Display |
LTR, RTL | http://ab.cd.com/mn/op http://ab.cd.FE.HG.com/JI/LK/mn/op |
Note that just the ordering of the fields is changed. The ordering of the characters within each field is unaffected; the rules for bidirectional display specified in the UBA still apply. For example, the field with memory order EF still displays from right to left, as FE.
An implementation of the UBA extension can accomplish the display of Table 3 by behaving as if:
If the field ordering of IRIs were consistently “big-endian”, it would be useful to have their display ordering depend on the direction of the paragraph. However, IRIs are not consistently big-endian; the most important part, the domain, has its fields organized in little-endian order. For example, http://www12.sap.com/uk/about would be http://com.sap.www12/uk/about if it were in big-endian order.
Because the ordering of fields in an IRI is already inconsistent, this proposal is to have a consistent ordering always, no matter what the bidi environment is (RTL vs LTR).
Alternatively, the ordering of fields could be subject to the environment (whether the current embedding level is RTL or LTR). Thus the examples shown above in Table 2 will instead appear as in Table 4:
Table 4. Environment order
Environment | Display |
LTR | http://ab.cd.com/mn/op http://ab.cd.FE.HG.com/JI/LK/mn/op |
RTL | op/mn/com.cd.ab//:http |
A third option would be to have the ordering not depend only on the environment, but instead depend on whether there were any RTL characters in the IRI. Thus the examples shown above in Table 2 will instead appear as in Table 5:
Table 5. Content order
Environment | Display |
LTR, RTL | http://ab.cd.com/mn/op |
The Unicode Technical Committee would appreciate feedback on the pros and cons of these various options.