Revision of UBA for improved display of URL/IRIs

Contents

Introduction

Recognizing IRIs in plain text

Open issues

Termination and Continuation

Proposed extension of UBA for bidi_IRIs

Open Issues

Introduction

The Unicode Bidirectional Algorithm (UBA), specified in Unicode Standard Annex #9, was designed for handling ordinary text, and predated the rise of the web. Unfortunately, IRI/URLs* are not ordinary text; they are syntactically complex in ways that don’t work well with the UBA. That causes IRIs that contain right-to-left text (such as Arabic or Hebrew) to appear jumbled, to the point where the IRIs are either uninterpretable, misleading, or ambiguous. In particular the ambiguous displays could cause security problems.

*Formally speaking, what we are talking about are IRIs, although most end-users know them as URLs. For more information, see idn-and-iri.

For example, consider the IRIs in Table 1. These sample IRIs are shown in “memory” order first; uppercase bold represents right-to-left characters.

Table 1. Sample IRIs, Memory order

IRIs	Comments on fields
http://ab.cd.com/mn/op	All LTR characters
http://ab.cd.EF.GH.com/IJ/KL/mn/op	Mixture of RTL and LTR fields
http://EF.GH/IJ/KL	All RTL characters (except the scheme “http”)

They would be displayed as in Table 2 by any Unicode-compliant implementation.

Table 2. Sample IRIs, Display order

Environment

Display

Details

LTR

http://ab.cd.com/mn/op

http://ab.cd.HG.FE.com/LK/JI/mn/op

http://LK/JI/HG.FE

http://goo.gl/yvNFq

RTL

http://ab.cd.com/mn/op

mn/op/LK/JI/com.HG.FE.http://ab.cd

LK/JI/HG.FE//:http

http://goo.gl/DuELB

People have been looking for an extension to the Unicode Bidirectional Algorithm (UBA) that handles IRIs in a more consistent way for bidi users. The general goal of such an extension would be for the “fields” of an IRI to flow in a consistent direction.

Deployment of such an extension requires consistency in usage across different applications. For example, when someone copies the contents of an address bar into an email, we don’t want all the fields in the IRI to switch around. Attaining such consistency would require a general extension to the UBA to indicate how IRIs should be displayed, both in the limited context of an address bar, and in plain text.

There are challenges for developing such an extension.

There will be a long migration period as implementations are updated from the current version of UBA to a revised version. During that period, people wi
ll be using a mixture of old and new applications. It is crucial to minimize the negative effects for interoperability of different displays of IRIs during this transition period.
It is necessary to have a simple, comprehensible way to recognize IRIs in plain text to support correct display in text, and to avoid jumbled, ambiguous IRIs

Recognizing IRIs in plain text

There are two problems:

A formal reading of the IRI spec allows almost anything in fields, so it is hard to test for termination.
The scheme for an IRI (such as “http://”) is commonly omitted when IRIs are represented in text. For example, the string “adobe.com” should be recognized as being an IRI when it occurs in the body of an email message.

The first issue can be addressed with the following approach. While in theory, almost any Unicode character can occur in fields in an IRI, in practice many characters have very restricted usage in IRIs. One can take advantage of this pattern by defining a restricted and simplified syntax for IRIs, which captures the majority of actual practice. This syntax can then be used to define display of IRIs for UBA. IRIs that need to use characters outside of this restricted syntax can still be appropriately displayed by the UBA by representing those characters with % escapes.

For the second issue, a common technique can be applied. The technique is to recognized a list of TLDs in context. For example, microsoft.com and google.de can both be recognized. (See http://www.iana.org/domains/root/db/).

Here is a proposed BNF for recognizing IRIs in plain text. This BNF uses a Perl-style syntax:

[[Fix formatting]]

bidi_IRI := ((scheme “://” domain) | domainWithTLD)
(“/” path)?

(“?” query)?

(“#” fragment)?

domain := label (IDNSep label)* IDNSep?

domainWithTLD := label (IDNSep label)* IDNSep TLD IDNSep?

label := UTS46Chars +

IDNSep := [\u002E \uFF0E \u3002\uFF61] // see http://unicode.org/reports/tr46/#Notation

TLD := <list on http://www.iana.org/domains/root/db/>

path := (char - “?” - “#”)*

query := (char - “#”)*

fragment := char*

char := percentEncodedUTF8

| [[:L:][:N:][:M:][:S:][:Pd:][:Pc:][:Cf:] inclusionChar - exclusionChar]

inclusionChar :=

U+0021 ( ! ) EXCLAMATION MARK

U+0022 ( " ) QUOTATION MARK

U+0023 ( # ) NUMBER SIGN

U+0025 ( % ) PERCENT SIGN

U+0026 ( & ) AMPERSAND

U+0027 ( ' ) APOSTROPHE

U+002A ( * ) ASTERISK

U+002C ( , ) COMMA

U+002E ( . ) FULL STOP

U+002F ( / ) SOLIDUS

U+003A ( : ) COLON

U+003B ( ; ) SEMICOLON

U+003F ( ? ) QUESTION MARK

U+0040 ( @ ) COMMERCIAL AT

U+005C ( \ ) REVERSE SOLIDUS

U+00A1 ( ¡ ) INVERTED EXCLAMATION MARK

U+00B7 ( · ) MIDDLE DOT

U+00BF ( ¿ ) INVERTED QUESTION MARK

exclusionChar :=

U+003C ( < ) LESS-THAN SIGN

U+003E ( > ) GREATER-THAN SIGN

There is one extra condition:

When parsing a bidi_IRI, an inclusionChar (such as “!”) is treated specially. If it is followed by a char, then it is included in the BNF. If not, it is excluded. For example, “abc.com/foo.bar” would be parsed completely as a bidi_RI, but “abc.com/foo. other text” would stop before the period.

It is possible to capture this extra condition in the BNF, but it would make the formulation far less readable.

Open issues

This proposed BNF needs to be fleshed out with some other IRI features such as userinfo, port, and IP.
We’d like any other feedback as to characters (or categories of characters) to include or exclude.
In particular, the inclusionChars are from ASCII/Latin1. Should any of the ASCII/Latin1 exceptions be changed to terminating characters instead?

To compare this simplified UBA syntax for a bidi_IRI with the full standard IRI syntax, see: http://rfc-ref.org/RFC-TEXTS/3987/chapter2.html

Termination and Continuation

In summary, encountering any of the following classes of characters would cause parsing of a bidi_IRI in plain text to continue:

Letters, Marks, Numbers
Dash and connector punctuation
Symbols

Encountering any of the following classes of characters would cause parsing of a bidi_IRI in plain text to terminate:

Unassigned, surrogates, private-use, control codes
Whitespace
Opening, closing or most ‘other’ punctuation, plus special cases < and >.

Many of these characters can be included in an IRI, but they would need to be % encoded for the specialized bidi display to kick in.

For ASCII and Latin1, the list of terminating punctuation consists of:

U+003C < LESS-THAN SIGN

U+003E > GREATER-THAN SIGN

U+0028 ( LEFT PARENTHESIS

U+0029 ) RIGHT PARENTHESIS

U+005B [ LEFT SQUARE BRACKET

U+005D ] RIGHT SQUARE BRACKET

U+007B { LEFT CURLY BRACKET

U+007D } RIGHT CURLY BRACKET

U+00AB « LEFT-POINTING DOUBLE ANGLE QUOTATION MARK

U+00BB » RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

Proposed extension of UBA for bidi_IRIs

The proposed UBA extension would display any recognized bidi_IRI in the following way. The goal of this display is to ensure that the fields of bidi_IRI occur in a predictable, uniform order, regardless of whether each field has RTL or LTR characters.

Given a bidi_IRI:

A separator is defined as any instance of the quoted strings in the bidi_IRI BNF:

right after a scheme: “://”
in a domain: IDNSep
right after a domain: “/” , “?”, “#”
in a path: “/”
right after a path: “?”, “#”
in a query: “=” or “&”
right after a query: “#”

A field is defined as any text between separators, or at the front or end.

Each bidi_IRI is displayed with fields from left to right. Thus the examples shown above in Table 2 will instead appear with a predicatible order of their fields. The order is the same whether the IRI is in a RTL or in a LTR environment (paragraph).

Table 3. Constant order

Environment

Display

LTR, RTL

http://ab.cd.com/mn/op

http://ab.cd.FE.HG.com/JI/LK/mn/op

http://FE.HG/JI/LK

Note that just the ordering of the fields is changed. The ordering of the characters within each field is unaffected; the rules for bidirectional display specified in the UBA still apply. For example, the field with memory order EF still displays from right to left, as FE.

An implementation of the UBA extension can accomplish the display of Table 3 by behaving as if:

the entire bidi_IRI is embedded in an <LRE>...<PDF> sequence, and
each field in that bidi_IRI is surrounded by LRMs.

Open Issues

If the field ordering of IRIs were consistently “big-endian”, it would be useful to have their display ordering depend on the direction of the paragraph. However, IRIs are not consistently big-endian; the most important part, the domain, has its fields organized in little-endian order. For example, http://www12.sap.com/uk/about would be http://com.sap.www12/uk/about if it were in big-endian order.

Because the ordering of fields in an IRI is already inconsistent, this proposal is to have a consistent ordering always, no matter what the bidi environment is (RTL vs LTR).

Alternatively, the ordering of fields could be subject to the environment (whether the current embedding level is RTL or LTR). Thus the examples shown above in Table 2 will instead appear as in Table 4:

Table 4. Environment order

Environment

Display

LTR

http://ab.cd.com/mn/op

http://ab.cd.FE.HG.com/JI/LK/mn/op

http://FE.HG/JI/LK

RTL

op/mn/com.cd.ab//:http

op/mn/LK/JI/com.HG.FE.cd.ab//:http

LK/JI/HG.FE//:http

A third option would be to have the ordering not depend only on the environment, but instead depend on whether there were any RTL characters in the IRI. Thus the examples shown above in Table 2 will instead appear as in Table 5:

Table 5. Content order

Environment	Display
LTR, RTL	http://ab.cd.com/mn/op op/mn/LK/JI/com.HG.FE.cd.ab//:http LK/JI/HG.FE//:http

The Unicode Technical Committee would appreciate feedback on the pros and cons of these various options.