Notes on IDN Meeting

M. Davis

Here are my notes on the recent informal IDN meeting. I also list some open questions.

Notes on Consensus
Documents
Tables (draft restructuring)

Notes on Consensus

1. Protocol.

In the protocol, there will be precisely three categories:

Allowed (probably to be renamed as protocol-valid)
Never (probably to be renamed as disallowed)
Unassigned

These categories must be checked both by the Registry and Resolver.

We will also distinguish a set of contextual constraints, as per the following table. These will be described in detail in either the Protocol document or the Bidi document. Here are the constraints, and where they must be checked.

Constraint	Registry	Resolver
BIDI restrictions (see idna-bidi)	MUST	SHOULD
isNFC	MUST	MUST*
Forbid initial Combining Mark	MUST	SHOULD*
Join Controls in limited contexts*	MUST	MUST

Notes:

Some of the SHOULD's may change to MUST, in particular, the Forbid initial Combining Mark
The isNFC is a MUST for Resolvers if toNFC is not done in resolving.
There may be some additional contextual constraints
Some people disagree that we excluded other Cf characters in the meeting.

For each successive version of Unicode, code points will move from Unassigned into either Allowed or Never. The choice between the latter is on the basis of the Tables rules.

Characters may move between Allowed and Never or have additional contextual requirements added, but only in the case of disasters.

The rules for Allowed use Unicode properties plus a small list of exceptions. The rules are based on those in the Tables document 04 for {ALWAYS, MAYBE YES, MAYBE NO, CONTEXTUAL} (see http://tools.ietf.org/html/draft-faltstrom-idnabis-tables, 04).

2. Registry Advice.

Registries need to have a source of information as to which characters are appropriate for which languages / environments. This is decoupled from the Protocol, and is not a gating item for the Protocol's release, but should be worked on actively in parallel. It may not be an RFC; perhaps being hosted by an organization such as UNESCO (or IANA, ICANN, IETF?), but probably with active participation by other organizations that can supply information: UNGEGN, Unicode, and so on.

3. Preprocessing RFC

We did not take up the question of a Preprocessing RFC except insofar as to decide that it was also not a gating item for the Protocol's release.

4. Working Group

We agreed to recommend the formation of a working group, with Vint as the chair. He can call on others to share the load wherever necessary. This WG is expected to be of fast setup, short duration, and probably not hold face to face meetings.

Actions:

Patrik, John, Harald, Cary to work on the changes to their respective documents
Mark and Erik to verify the BIDI rules.
Lisa Dusseault to work on chartering the working group
Others to assist as called upon.

Misc Notes and Questions

The names of Allowed/Never/Unassigned may change if we find better names.
As discussed, with a successive version of Unicode there may be a handful of characters that change relevant properties. There are two reasonable choices that preserve absolute backwards compatibility, and we didn't clearly decide between them.

Revise the Tables document if that happens. Any such occurrence should be rare, and the lead time for a Unicode version is long, allowing for time to produce a revision.
Add a new, formal Unicode property with the exceptions (corresponding to 2.2.3. Category I - Backward compatibility). It would be empty for Unicode 5.1. This property would be formally derived, and include all and only those few exceptional characters that are needed to ensure that the Table rules support backwards compatibility to previous versions.

There may be some small changes to Allowed for the exceptional cases (eg review of HEBREW PUNCTUATION GERSHAYIM).
On the working group charter: in my opinion, the possibility of a Preprocessing RFC should be part of the charter, so that the working group can decide whether or not to produce it without requiring rechartering.
We did not settle on the precise formulation of the contextual constaint on ZWJ/ZWNJ.

(My recommendation would be to base on UAX #31, Section 2.3 Layout and Format Control Characters. This is currently in the equivalent of Last Call, and will be final in March.)

Documents

(not yet reflecting the above consensus at the time of this writing)

http://tools.ietf.org/html/draft-klensin-idnabis-issues
http://tools.ietf.org/html/draft-klensin-idnabis-protocol
http://tools.ietf.org/html/draft-faltstrom-idnabis-tables
http://tools.ietf.org/html/draft-alvestrand-idna-bidi

Table Rules (draft restructuring)

Here are draft table rules after a possible simplification of http://tools.ietf.org/html/draft-faltstrom-idnabis-tables, based on the above consensus. It does not represent a consensus on the tables -- it is just my interpretation of how the consensus could be reflected in the tables.

It uses Unicode Regex notation, where [:property=value:] is the set of characters having the specified value for the specified property. However, that notation is not necessary for any final document -- it is only used here for simplicity in relating to Unicode properties. (Actually, Unicode regex also allows Perl syntax, such as \p{Cn}, if preferred.) Note: the order of boolean set operations is important.

The Categories follow the draft Tables 04 document.

IDN=Allowed is defined as

Unicode Regex	Description	Tables 04
`[[:L:][:Mn:][:Mc:][:Nd:]]`	// restrict to only letters, marks, numbers	Category A
`- [:NFKC_QC=N:]`	// minus characters unstable under NFKC	Category B
`- [:^isCaseFolded:]`	// minus characters unstable under case folding	Category C
`- [:di:]`	// minus default-ignorables	Category D
`- [:IDN_Exceptions=Disallowed:]`	// minus exceptional exclusions (currently empty)	New (empty)
`+ [:IDN_Exceptions=Allowed:]`	// plus exceptional inclusions (see below)	Category H*
`+ [:Join_Control:]`	// plus join controls (withcontextual constraints)	Category J*
`+ [a-z0-9\-]`	// ASCII LDH (only the '-' is actually significant)	Category G

Category J is changed from {Cf} to just Join_Controls.

U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER

Category J is still under debate. Tables 04 has the following contents.

U+00B7 ( · ) MIDDLE DOT
U+05F3 ( ‎׳‎ ) HEBREW PUNCTUATION GERESH
U+05F4 ( ‎״‎ ) HEBREW PUNCTUATION GERSHAYIM
U+3005 ( 々 ) IDEOGRAPHIC ITERATION MARK

U+3007 ( 〇 ) IDEOGRAPHIC NUMBER ZEROU+303B ( 〻 ) VERTICAL IDEOGRAPHIC ITERATION MARK

U+30FB ( ・ ) KATAKANA MIDDLE DOT

My opinion:

U+00B7 ( · ) MIDDLE DOT is needed for Catalan orthography
U+3007 ( 〇 ) IDEOGRAPHIC NUMBER ZERO is needed to complete CJK numbers
U+3005 ( 々 ) IDEOGRAPHIC ITERATION MARK is needed for the personal and geographical names are written using this character, and they would be misspelled if they were written with repeated characters.
The other 4 listed exceptions are not required for IDN, since they are all optional. While the geresh and gersham are concerned, they may be useful for abbreviations in Hebrew, but according to the information I've gotten, they are not required any more than we need ":", which is in abbreviations like "c:a" in Swedish. The vertical iterations mark is just a presentation form for vertical writing. And the katakana middle dot is not part of words -- it's a word separator, like many others that we disallow. For background information, see wikipedia Geresh, Gershayim, Iteration_mark, Interpunct

The Unicode utilities can be used to view the above, for example:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u00B7\u05F3\u05F4\u3005\u3007\u303B\u30FB]

Unassigned is defined as

Unicode Regex	Description	Tables 04
`[:Cn:]`	// unassigned code points	Category K

Disallowed is defined as

Unicode Regex	Description
`[\u0000-\U0010FFFF]`	// All Unicode code points
`- Unassigned`	// minus Unassigned
`- Allowed`	// minus Allowed

L2/08-099