Notes on IDN Meeting
M. DavisHere are my notes on the recent informal IDN meeting. I also list some open questions.
Contents
Notes on Consensus
1. Protocol.
In the protocol, there will be precisely three categories:
-
Allowed (probably to be renamed as protocol-valid)
-
Never (probably to be renamed as disallowed)
-
Unassigned
These categories must be checked both by the Registry and Resolver.
We will also distinguish a set of contextual constraints, as per the following table. These will be described in detail in either the Protocol document or the Bidi document. Here are the constraints, and where they must be checked.
Constraint
|
Registry
|
Resolver
|
BIDI restrictions (see idna-bidi)
|
MUST
|
SHOULD
|
isNFC
|
MUST
|
MUST* |
Forbid initial Combining Mark
|
MUST
|
SHOULD* |
Join Controls in limited contexts*
|
MUST
|
MUST
|
Notes:
- Some of the SHOULD's may change to MUST, in particular, the Forbid initial Combining Mark
- The isNFC is a MUST for Resolvers if toNFC is not done in resolving.
- There may be some additional contextual constraints
- Some people disagree that we excluded other Cf characters in the meeting.
For each successive version of Unicode, code points will move from Unassigned into either Allowed or Never. The choice between the latter is on the basis of the Tables rules.
Characters may move between Allowed and Never or have additional contextual requirements added, but only in the case of disasters.
The rules for Allowed use Unicode properties plus a small list of exceptions. The rules are based on those in the Tables document 04 for {ALWAYS, MAYBE YES, MAYBE NO, CONTEXTUAL} (see
http://tools.ietf.org/html/draft-faltstrom-idnabis-tables, 04).
2. Registry Advice.
Registries need to have a source of information as to which characters are appropriate for which languages / environments. This is decoupled from the Protocol, and is not a gating item for the Protocol's release, but should be worked on actively in parallel. It may not be an RFC; perhaps being hosted by an organization such as UNESCO (or IANA, ICANN, IETF?), but probably with active participation by other organizations that can supply information: UNGEGN, Unicode, and so on.
3. Preprocessing RFC
We did not take up the question of a Preprocessing RFC except insofar as to decide that it was also not a gating item for the Protocol's release.
4. Working Group
We agreed to recommend the formation of a working group, with Vint as the chair. He can call on others to share the load wherever necessary. This WG is expected to be of fast setup, short duration, and probably not hold face to face meetings.
Actions:
-
Patrik, John, Harald, Cary to work on the changes to their respective documents
-
Mark and Erik to verify the BIDI rules.
-
Lisa Dusseault to work on chartering the working group
-
Others to assist as called upon.
Misc Notes and Questions
-
The names of Allowed/Never/Unassigned may change if we find better names.
-
As discussed, with a successive version of Unicode there may be a handful of characters that change relevant properties. There are two reasonable choices that preserve absolute backwards compatibility, and we didn't clearly decide between them.
-
Revise the Tables document if that happens. Any such occurrence should be rare, and the lead time for a Unicode version is long, allowing for time to produce a revision.
-
Add a new, formal Unicode property with the exceptions (corresponding to 2.2.3. Category I - Backward compatibility). It would be empty for Unicode 5.1. This property would be formally derived, and include all and only those few exceptional characters that are needed to ensure that the Table rules support backwards compatibility to previous versions.
-
There may be some small changes to Allowed for the exceptional cases (eg review of HEBREW PUNCTUATION GERSHAYIM).
-
On the working group charter: in my opinion, the possibility of a Preprocessing RFC should be part of the charter, so that the working group can decide whether or not to produce it without requiring rechartering.
- We did not settle on the precise formulation of the contextual constaint on ZWJ/ZWNJ.
Documents
(not yet reflecting the above consensus at the time of this writing)
http://tools.ietf.org/html/draft-klensin-idnabis-issues
http://tools.ietf.org/html/draft-klensin-idnabis-protocol
http://tools.ietf.org/html/draft-faltstrom-idnabis-tables
http://tools.ietf.org/html/draft-alvestrand-idna-bidi
Table Rules (draft restructuring)
Here are draft table rules after a possible simplification of http://tools.ietf.org/html/draft-faltstrom-idnabis-tables, based on the above consensus. It does not represent a consensus on the tables -- it is just my interpretation of how the consensus could be reflected in the tables.
It uses Unicode Regex notation, where [:property=value:] is the set of characters having the specified value for the specified property. However, that notation is not necessary for any final document -- it is only used here for simplicity in relating to Unicode properties. (Actually, Unicode regex also allows Perl syntax, such as \p{Cn}, if preferred.) Note: the order of boolean set operations is important.
The Categories follow the draft Tables 04 document.
IDN=Allowed is defined as
Unicode Regex
|
Description
|
Tables 04 |
[[:L:][:Mn:][:Mc:][:Nd:]]
|
// restrict to only letters, marks, numbers
|
Category A |
- [:NFKC_QC=N:]
|
// minus characters unstable under NFKC
|
Category B |
- [:^isCaseFolded:]
|
// minus characters unstable under case folding
|
Category C |
- [:di:]
|
// minus default-ignorables
|
Category D |
- [:IDN_Exceptions=Disallowed:]
|
// minus exceptional exclusions (currently empty)
|
New (empty) |
+ [:IDN_Exceptions=Allowed:]
|
// plus exceptional inclusions (see below) |
Category H* |
+ [:Join_Control:]
|
// plus join controls (withcontextual constraints)
|
Category J* |
+ [a-z0-9\-]
|
// ASCII LDH (only the '-' is actually significant)
|
Category G |
Category J is changed from {Cf} to just Join_Controls.
U+200C
( ) ZERO WIDTH NON-JOINER
U+200D
( ) ZERO WIDTH JOINER
Category J is still under debate. Tables 04 has the following contents.
U+00B7
( · ) MIDDLE DOT
U+05F3
( ׳ ) HEBREW PUNCTUATION GERESH
U+05F4
( ״ ) HEBREW PUNCTUATION GERSHAYIM
U+3005
( 々 ) IDEOGRAPHIC ITERATION MARK
U+3007
( 〇 ) IDEOGRAPHIC NUMBER ZERO
U+303B
( 〻 ) VERTICAL IDEOGRAPHIC ITERATION MARK
U+30FB
( ・ ) KATAKANA MIDDLE DOT
My opinion:U+00B7
( · ) MIDDLE DOT is needed for Catalan orthographyU+3007
( 〇 ) IDEOGRAPHIC NUMBER ZERO is needed to complete CJK numbersU+3005
( 々 ) IDEOGRAPHIC ITERATION MARK is needed for the personal and geographical names are written using this character, and they would be misspelled if they were written with repeated characters.- The other 4 listed exceptions are not required for IDN, since they are all optional. While the geresh and gersham are concerned, they may be useful for
abbreviations in Hebrew, but according to the information I've gotten,
they are not required any more than we need ":", which is in
abbreviations like "c:a" in Swedish. The vertical iterations mark is just a presentation form for vertical writing. And the katakana middle
dot is not part of words -- it's a word separator, like many others
that we disallow. For background information, see wikipedia Geresh, Gershayim, Iteration_mark, Interpunct
The Unicode utilities can be used to view the above, for example:
Unassigned is defined as
Unicode Regex
|
Description
|
Tables 04 |
[:Cn:]
|
// unassigned code points
|
Category K |
Disallowed is defined as
Unicode Regex
|
Description
|
[\u0000-\U0010FFFF]
|
// All Unicode code points
|
- Unassigned
|
// minus Unassigned
|
- Allowed
|
// minus Allowed
|