From: unicore-bounce@unicode.org
[mailto:unicore-bounce@unicode.org] On Behalf Of Matitiahu Allouche
Sent: Wednesday, February 04, 2009 9:22 AM
To: Lisa Moore
Cc: bidi@unicode.org; unicore@unicode.org; x3l2@unicode.org
Subject: Re: UTC agenda - updated IDNA criterion for right-to-left
scripts
Here are the
comments that I submitted to the authors of the document (which can be accessed
at http://www.ietf.org/internet-drafts/draft-ietf-idnabis-bidi-03.txt
).
This document is on the UTC agenda for this
week's meeting. It is L2/09-046 in the L2 document register.
<start of
comments>
My attention
was recently drawn to the subject document (version 03) and I have some
comments. Some of them are very minor (typos, editorial) and reflect my
pedantic mind, but I thought that I could as well help improve the form of the
document. Other comments touch more to the essence, and I will appreciate
considering them seriously.
1) In section
2, first paragraph, "satisifes" should be "satisfies".
2) Section 2,
rule 1 mentions the "Character Grouping requirement" for the first
time in the document. Either there should be a forward reference to
section 3 where it will be explained, or (better, in my opinion), the content
of the current section 3 should precede the content of the current section 2.
3) In the
sentence "ET is excluded because the string L ET does not satisfy the Character
Grouping requirement.", "L" seems to represent a label, but can
easily be confused with the L Bidi property (all the more since it is adjacent
to ET which surely represents a character with the ET Bidi property).
4) In the
sentence "CS is excluded because the string L CS does not satisfy the Character
Grouping requirement.", "L" seems to represent a label, but can
easily be confused with the L Bidi property (all the more since it is adjacent
to CS which surely represents a character with the CS Bidi property).
5) I see no
reason why CS is excluded while ES is allowed. Both can be the source of
the same kind of violation of the Character Grouping requirement.
ES characters are excluded from the first and last positions by rules 2
and 3. With the same restrictions (exclusion from the first and last
positions), ES and ET characters can be allowed and will not violate the
Character Grouping requirement any more than ES characters.
6) In section
1.1, there appears the following statement: "This specification
is not intended to place any requirements on domain names that do not contain
right-to-left characters."
Also the title
of section 2 is "A replacement for the RFC 3454 BIDI rule" which implies
that the text only deals with "Bidi" labels.
If that means
that the specification applies only to labels which contain at least one
character with Bidi property R, AL or AN, and we combine that with rule 4
"If an R, AL or AN is present, no L may be present.", then an L
character can never be part of a Bidi label, and the L should be removed from
the list of allowed Bidi properties in rule 1.
7) In [UAX9],
rule X9 says that BN characters must be removed from the displayed text.
Any such invisible character violates the Label Uniqueness requirement.
BN characters must not be allowed by rule 1.
8) From rules
1, 2, 4, 6 and 7, plus our comments 6 and 7 above, it results that the first
character of a Bidi label can only be of type R or AL. Such a statement
can advantageously replace rules 2, 6 and 7.
9) Rule 5
includes no justification. While a mixture of AN and EN characters in the
same label seems odd and not required in real life situations, it is not clear
what requirement would be violated by such a combination.
10) The rules
allow AN or EN digits to appear in the last position of a label (in opposition
to RFC 3454). Let us consider the following examples (where lower case
letters represent L characters and upper case letters represent R or AL
characters):
a.
network order = "ABC123.456xyz" display order (LTR) =
"123.456CBAxyz" display order (RTL) = "123.456xyzCBA"
b.
network order = "ABC.456-xyz" display order (LTR) =
"456.CBA-xyz" display order (RTL) = "xyz-456.CBA"
c.
network order = "ABC123.456.xyz" display order (LTR) =
"123.456CBA.xyz" display order (RTL) =
"xyz.123.456CBA"
d.
network order = "ABC.456.xyz" display order (LTR) =
"456.CBA.xyz" display order (RTL) = "xyz.456.CBA"
Examples a, b
and c show very ugly violations of the Character Grouping requirement.
Since the document does not place requirements on non-Bidi labels, any
non-Bidi label starting with digits following a Bidi label will cause a
Character Grouping violation. If Bidi labels are restricted from ending
with digits (optionally followed by NSMs), then non-Bidi labels which contain
only digits (example d) following a Bidi label will not cause a Character
Grouping violation.
Whether this
modest benefit justifies imposing such a restriction is subject to discussion.
11) Towards the
end of section 2, there appears the following sentence: "In a domain name
consisting of only labels that pass the test, the requirements of Section 3 are
satisfied."
This is not
true for domain names like in the examples above, unless non-Bidi labels are
excluded, which is a very hard constraint.
12) The next
sentence says: "In a domain name consisting of only LDH-labels and
labels that pass the test, the requirements of Section 3 are satisfied as long
as a label that starts with an ASCII digit does not come after a right-to-left
label that ends in a digit."
This is not
true. See example b above.
13) In section
3, there appears the sentence: "the label "123-456" will have a
different display order in an RTL context than in a LTR context."
This is not
true, IMHO. If the last letter before the label is not an Arabic Letter,
it will be displayed as "123-456" both in LTR and RTL context.
If it is an Arabic Letter, it will be displayed as "456-123".
14) In section
3, there appears the sentence: "The Label Uniqueness property should hold
true between LTR paragraphs and RTL paragraphs. This was shown to be
unsound."
In fact, in all
cases where Character Grouping and Label Uniqueness are satisfied for each
paragraph direction separately, there will be Label Uniqueness between LTR and
RTL paragraphs.
15) In section
3, since an "unproblematic label" can be a label which satisfies the
requirements, the clause "any label S1 and S2 that is either a label
satisfying the requirements or an unproblematic label" can be
shortened to "any label S1 and S2 that is an unproblematic label".
16) In the
formal statement of the Label Uniqueness requirement, there is no provision (or
exclusion) for the case where L and L' are identical.
17) In summary
I suggest that the rules in section 2 should be reformulated as below.
1. Only characters with the BIDI properties R, AL, AN, EN, ES,
CS, ET, ON and NSM are allowed in RTL labels.
2.
The first position must be a character with Bidi property R or AL.
3.
The last position must be a character with Bidi property R or AL,
followed by zero or more NSM.
3
variant. The last position must be a character with Bidi property R,
AL, EN or AN, followed by zero or more NSM.
4
(debatable). If an EN is present, no AN may be present, and vice
versa.
It can
be seen that this formulation is quite close to that in RFC 3454, while solving
all the problems that this document aims to solve.
<end of
comments>
Shalom (Regards), Mati
Bidi Architect
Globalization Center Of Competency -
Bidirectional Scripts
IBM Israel