L2/05-200R

Comments on Public Review Issues (May 5, 2005 - August 9, 2005)

The sections below contain comments received on the open Public Review Issues as of August 9, 2005, since the previous cumulative document was issued prior to UTC #103 (May 2005).

Closed Issue: 64 Draft UTR #36: Security Considerations for the Implementation of Unicode and Related Technology

Date/Time: Mon Jun 27 16:17:44 CDT 2005
Contact: Douglas Davidson
Subject: Comments on draft UTR#36 section 10.2.3

I have read with interest the current draft of UTR#36, and I have some comments on section 2.10.3 (user agent recommendations).

For point B, keep in mind that many user agents have at least one mode of display ("view source" or the like) in which markup is expected to be displayed verbatim, and that this is often the mechanism that sophisticated users will use to examine the details of links. It is probably not in keeping with the nature of this mechanism to specially prepare or otherwise modify the representation of a URL when displayed in this way. Perhaps the recommendation should be that a prepared, highlighted version of the URL should be made readily available in a single well-understood location, not that it should always be used?

For point C, a set of preferences based on restriction levels probably is not practicably usable for any but the most technically competent users--such as those who are likely to be reading this report. An alternative mechanism, along the lines of the "don't ask me again" checkbox in your example alert, is probably more practicable. (However, your example checkbox does not clearly state whether "don't ask me again" applies to the given URL, the given domain, the entire restriction level, or maybe every other instance of this particular alert.)

For point D, the term "in-script confusable" is used, but appendix B does not use this term; appendix B describes "single-script", "mixed-script", and "whole-script" confusables. It is not clear what an "in-script confusable" is supposed to be.

For point D.2, you should keep in mind that DNS resolution is potentially an extremely expensive operation, possibly requiring more than a minute to return in the worst case. In the presence of malice, the worst case should be expected to occur, since the opponent can often arrange for it to occur. Furthermore, it is certainly possible in normal use for a given name to resolve to multiple DNS addresses for perfectly innocent reasons--load balancing, for example.

Considering this, and considering the number of possible confusables that might exist for a given identifier, verifying that they all point to the same IP address seems unworkable, at least from an a priori viewpoint. I would suggest not recommending this until you see at least one working implementation.

Date/Time: Mon Jun 27 16:41:03 CDT 2005
Contact: Peter Kirk
Subject: Review of draft TR36

Reviewing the list of proposed restricted characters:

U+04C0 CYRILLIC LETTER PALOCHKA is definitely required in IDNs and so should not be restricted, because (although it is caseless) it is an integral part of the alphabet and orthography of many languages of the southern Russia.

Similarly U+02BC MODIFIER LETTER APOSTROPHE and possibly U+02EE MODIFIER LETTER DOUBLE APOSTROPHE are used as integral parts of certain Cyrillic alphabets in the former Soviet republica, and sometimes are used in replacement Latin alphabets. The former is especially important in (current Latin script) Uzbek, for it is used in the local form of the name of the country (Oʼzbekiston). These are generally caseless, but sometimes written with apparent case variants mostly in character height. In these languages these characters are not punctuation - although in practice they are sometimes replaced by U+0027 or other punctuation apostrophe variants.

The Syriac script should also not be restricted, because it is not just a "liturgical script" but the living script of significant communities in the Middle East and in the USA etc.

U+05BE HEBREW PUNCTUATION MAQAF, U+05F3 HEBREW PUNCTUATION GERESH and U+05F4 HEBREW PUNCTUATION GERSHAYIM should not be restricted. MAQAF is the Hebrew equivalent of a hyphen and so like hyphen should be available for IDNs. GERESH and GERSHAYIM, although in some senses punctuation, are used in the normal spelling of foreign loan words and words originating as acronyms. Restricting these from use in Hebrew IDNs would be an unwarranted restriction on using the orthographically correct form of Hebrew names on the Internet.

U+0269 LATIN SMALL LETTER IOTA should not be restricted just because it is obsoleted by IPA, because the presence of U+0196 LATIN CAPITAL LETTER IOTA (although marked ~IDNA) is an indication that this letter is used for writing of real languages and not just for obsolete IPA.

Date: 2005-06-28 02:56:40 -0700
Contact: Dominikus Scherkl
Subject: RE: Notification re UTR #36, Security Issues

Hi.

In Appendix B of the UTR#36 draft, the example to the "MA confusables table" looks wrong to me, because capital greek ny simply isn't confusable with small latin v. Shouldn't it be "Capital Greek Ny" vs. "Capital Latin N" ?

Best regards,
Dominikus Scherkl

Date/Time: Tue Jun 28 05:43:59 CDT 2005
Contact: Peter Kirk
Subject: Comments on text of UTR #36

Section 2.8.1 Case-Folded Format: Another situation where standard case folding may be undesirable is with Turkish and Azerbaijani I and dotted I. The standard case folding here will destroy the important distinction between these two characters. This problem at least deserves a mention in the text.

Section 2.10.4 Registry Recommendations: I can see an issue here with a rogue "registrant". For example, the same "registrant", an unreliable agency, registers both caxap.com (Latin) and сахар.com (Cyrillic), but reassigns only one of these to its genuine client, and the other to a spoofer. Or an opportunist might acquire both versions to preempt the genuine company CAXAP and then sell the Latin script one to the genuine company, but keep hold of the Cyrillic version to use for spoofing. So perhaps the recommendations should be tighter, that the registry should treat the two IDNs as equivalent and always mapped to the same IP address.

Appendix D: There is no mention of the significant issue that certain Latin characters are used as part of Cyrillic script in some languages. This applies to Latin Q and W in Cyrillic Kurdish, and possibly some other mixtures. The mixed script detection algorithm should be adapted such that otherwise Cyrillic identifiers including Q or W, but no other Latin characters, are considered valid Cyrillic script. Note that there is no visual spoofing problem with these characters.

Appendix E: What are "Rumsfeld problems"? Rumsfeld has many problems, especially in Iraq (or do you mean anotehr Rumsfeld?), so this note needs some explanation and disambiguation.

Date/Time: Tue Jun 28 15:38:07 CDT 2005
Contact: Miikka-Markus Alhonen
Subject: Feedback on the draft of UTR #36 - Syriac

I noticed that in the 2005-06-28 version of http://unicode.org/draft/reports/tr36/data/draft-restrictions.txt, the Syriac script is listed in whole as disallowed with the justification "liturgical script". While it is true that the Syriac script is used in writing the liturgical Syriac language, it is also used in writing numerous present-day, living Syriac dialects. I have, for instance, now right in front of me a copy of the quarterly journal of the Assyrian American National Federation "Assyrian Star ܟܘܟܒܐ ܐܬܘܪܝܬ", which was published in 2004. The publication contains several pages of articles written in the Syriac script as well as a vocalised song, even though most of the journal is in English. Although the Syriac-speaking community is much smaller than, say, the Arabic one, it is still well alive and active even in the Internet. Very likely, most of the Syriac Internet users are in diaspora, i.e. residing in Sweden, other European countries or the USA, where technological possibilities are available to all.

For this reason I suggest that at least the basic letters 0710..072C be made unrestricted (these do contain a few Garshuni letters, the contemporaneous use of which I'm not entirely sure about). The use of vowel marks in Syriac is much more common than their Hebrew counterparts (which are restricted in this draft), although they are in no way obligatory. On the other hand, the Arabic vowels seem to be allowed, so I can not give you definite advice. It might be better to ask a true expert, namely one of the authors of the Syriac Unicode proposal, George A. Kiraz (AFAIK, his E-mail address is gak@bethmardutho.org). At least the vowel marks are not "liturgical", since they are used in contemporaneous writings, too. On the other hand, the Eastern marks (dots) are probably hard to distinguish in small font sizes, and the Western marks (Greek letter-like), are encoded separately above and below the letter, which might not be an obvious distinction for a Syriac speaker. From this technical point of view it might be better to disallow them for now. If you have any further questions, please feel free to contact me.

With kind regards,
Miikka-Markus Alhonen
MA in General Linguistics, studies including Syriac among other Semitic languages

Date/Time: Tue Jun 28 17:18:58 CDT 2005
Contact: Debbie Anderson (forwarding email)
Subject: Security -- feedback on UTR 36

(Forwarded)
Dear Deborah Anderson,

I have received your e-mail "Feedback needed on letters / symbols in Internationalized Domain Names" via Jeff Good. I had a look at the indicated website and saw clicks on the list of symbols to be excluded. I am quite worried about this fact, because these are not technical symbols, but used by Khoisan speech communities in practical orthographies which are also relevant for the internet. I would be grateful if you could communicate this to the people concerned.

Thanks and best wishes,

Tom Gueldemann
gueldema@rz.uni-leipzig.de

Date/Time: Fri Jul 1 06:40:35 CDT 2005
Contact: David Rowe
Subject: feedback on UTR #36 (rev 1.4) draft

See comments which follow this extract of the draft document. ----

# Characters restricted in domain names
# $Revision: 1.4 $
# $Date: 2005/06/25 01:38:40 $
#
# This file contains a draft list of characters for use in
# UTR #36: Unicode Security Considerations

# - Characters listed as ~IDNA are excluded at this point in domain names,
# in many cases because the international domain name specification does not contain
# characters beyond Unicode 3.2. At this point in time, feedback on those characters
# is not relevant.

0269 ; obsoleted by IPA in 1989 # LATIN SMALL LETTER IOTA

0196 ; ~IDNA # LATIN CAPITAL LETTER IOTA ----

The LATIN SMALL LETTER IOTA is used in the orthographies of Kabiye and Tem (in the country of Togo), and in Foodo (Benin) and possibly in other languages.

At this point in I do not have sample words to offer, although the language name "Kabiye" is spelled with an iota rather than an "i" when written in Kabiye.

Although comments on LATIN CAPITAL LETTER IOTA is not relevant at this time, I would like to point out that this upper case form is also used in these languages.

Thanks,
David Rowe
SIL Togo-Benin

Date/Time: Fri Jul 1 12:59:24 CDT 2005
Contact: Michael Everson
Subject: TR 36

Much in the draft TR 36 is very good, in terms of explanation of the problem and so on. But I STRONGLY urge caution in the publication of permitted and unpermitted characters. There is not consensus between UTC and IETF and ICANN on what the shape of IDN should be. I am NOT saying that it will take an eternity to achieve such consensus, but I AM saying that it isn't there yet. In a fortnight in Luxembourg ICANN is having a meeting where a large number of players in this arena will be meeting. Cary Karp from .museum, Michel Suignard, John Klensin, possibly James Seng, and I will be there. I urge the UTC not to publish a definitive UTR on this topic until consensus is achieved.

A specific fault in UTR 36 is that it is just a list of characters. For IDN to work, language-specific lists need to be coordinated with such a list of characters. This suggests that proper linguistic expertise may not have been applied in the drafting of the tables.

For instance, such lists exist for European languages. Such lists do not exist for many African languages.

A specific fault in http://www.unicode.org/draft/reports/tr36/data/review.txt is that it uses unexplained notations. What is "output"? What is "input-lenient"? Why are these terms used? What is "XID+"?

http://www.unicode.org/draft/reports/tr36/data/review.txt also STILL does not load characters in Safari.

Please, UTC, do not rush this. More haste less speed. The parties concerned with this matter include players other than the companies that make up the UTC. Without broader consensus, the UTR may not be accepted. But I agree that it is a good place to make the specification.

69 Proposed Update UAX #24, Script Names

Date/Time: Sun Jun 12 05:21:26 CDT 2005
Contact: charles@agenoria.fsnet.co.uk
Subject: Proposed Update UAX #24, Script Names

Is the iconic script indicator selected to represent the Latin script, "A", appropriate? Since this character shape also represents the first letter of the Greek and Cyrillic alphabets, users of those scripts may not immediately associate it with the Latin script.

Would one of the several upper case character shapes not shared with Greek and/or Cyrillic be better? "L" , also representing the initial letter of "Latin", might perhaps be suitable, although that character shape does also crop up in Cherokee.

Date/Time: Fri Jul 29 15:58:37 CDT 2005
Contact: Daniel Yacob
Subject: Iconic Script Indicators for Ethiopic

Greetings,

I was just looking through TR24 (http://www.unicode.org/reports/tr24/tr24-8.html) and wanted to suggest an alternative for the Ethiopic script icon. I suggest that U+130D is preferable to U+12A0 (current).

The name of the script in both Ethiopia and Eritrea is "Ge'ez", spelt with 3 Ethiopic letters, the first of which is U+130D and has been used as an icon in software for some time. So, amongst this audience (where "ethiopic" as a term is virtually unknown), there an association between the letter and the script as a whole.

U+12A0 lacks a similar association (it is not the first letter of the syllabary).

70 Proposed Draft UTS #37, Registration of Ideographic Variation Sequences

No feedback received this period.

71 Questions on Malayalam Digits

Date/Time: Tue Aug 2 23:04:50 CDT 2005
Contact: K.G.Sulochana
Subject: Question on Malayalam Digits

Information and evidence on the Malayalam Digit Zero, symbols for 10, 100, 1000 and fractions are available at http://www.malayalamresourcecentre.org/Mrc/symbols/malsymbols.html

The information is collected from some old publications and palm leaf manuscripts.

Note: The web page is posted as L2/05-164 in the document register.

72 Stability of the Bidi Mirrored Property

Date/Time: Tue Jun 21 06:28:30 CDT 2005
Contact: Matitiahu Allouche
Subject: Issue #72 Stability of the Bidi Mirrored Property

I am a user of Hebrew, in reading, writing and speaking, and also a programmer quite acquainted with the particularities of the Hebrew language in computer applications.

All the characters under discussion except 007E Tilde and 00AC Not Sign are not likely to have been used in any significant measure within a Bidi context, so I see no impediment to changing their mirroring property if appropriate.

Personally, I have never seen a need for a mirrored Tilde.

I see very restricted usefulness to a mirrored Not Sign.

Consequently, I favor option b) "Make the Bidi Mirrored property immutable after changing some of the values to make it more consistent".

Date/Time: Wed Jun 22 11:08:17 CDT 2005
Contact: Kent Karlsson
Subject: PRI 72 (mirroring)

(Since this is obviously based on a comment of mine...)

Either:

b) Make the Bidi Mirrored property immutable after changing some of the values to make it more consistent.

or better

c) Change some values to make the Bidi Mirrored property more consistent, and also continue to allow future changes.

I prefer c, since I would like to make all Sm (except for < and >, and negated < and >, since <> are "mis"used for brackets, and negated ones are decomposable to </> and a combining stroke) non-mirroring. I find it quite odd that arrows are not mirrored, but most other math symbols are. Either mirror them all (but you have excluded that) or mirror almost none of them. B.t.w., COMBINING LONG SOLIDUS OVERLAY is still not mirrored, even though most negated math operators are mirrored. If negated math operators are still to be mirrored, then also COMBINING LONG SOLIDUS OVERLAY should be mirrored (for NFC/NFD consistency), with COMBINING REVERSE SOLIDUS OVERLAY as the "closest character mirror"; this makes a problem for negated arrows, though. Conditional mirroring of the stroke overlay ([not]after arrow, or [not] after non-arrow Sm) seems problematic.

The easy part (from my point of view): Characters that come in open/close pairs (or almost do that) should be consistently mirrored.

Date/Time: Wed Jun 22 11:31:36 CDT 2005
Contact: Kent Karlsson
Subject: PRI 72 (mirroring), addendum

Regarding conditional mirroring of combining solidus (which would be problematic):

Note that the test cannot be on whether the base character is mirroring or not, since symmetric math symbols are (currently) not marked as mirroring (presumably due to their symmetry). However, they could be made mirroring "despite" their symmetry.

Note also that the "compatibility file" BidiMirroring.txt does not list many a "best fit", even though a "best fit" does exist, e.g.
        # 2260; NOT EQUAL TO
even though 2260 is a best fit single char there, and <EQUAL TO, COMBINING REVERSE SOLIDUS OVERLAY> would be an even closer fit (if not exact).

Date/Time: Tue Aug 9 23:07:50 CDT 2005
Name: Behdad Esfahbod
Subject: #72: Stability of the Bidi Mirrored Property

I'm all with option (b), although (c) is fine too, for the reasons that follow.

For changing small brackets, as mentioned, they are not expected to be found in RTL text, so the risk is minimal, lets go for consistency. Ditto for quotation marks.

Ornate Parantheses is a bit different, it's always used in RTL text, but I've see no text using it currently, so again, I'm with changing the property for the sake of consistency.

Tilde and not-sign are not mirrored in Persian, basically because they are symbols only used in math and Latin, imported into Persian, but considering their shape, I don't think a mirrored glyph would make much confusion either. The case of tilde is a bit more complicated though, since it looks like the ARABIC MADDA ABOVE. So, I vote for not mirroring the tildes, but am fine with either way about not-sign, just make them consistent.

73 Representative Glyphs for Arabic Characters U+06DF, U+06E0, and U+06E1

Date/Time: Tue, 21 Jun 2005 10:34:05 -0700
Contact: Mete Kural

Hello Asmus,

I also confirm Tom's observations and proposed changes to the reference glyphs used in the Arabic codepage. The glyphs Tom is suggesting are from the 1924 Cairo printing of the Qur'an which has been the model for subsequent Quran printings in the 20th century and till today.

Kind regards, Mete

74 Change to Default Localization for NaN in CLDR

(Feedback goes to the CLDR-TC.)