LGR for unspecified language | Selected-RefLGR-but-not-recommended-IdentifierType |
---|
This document is mechanically formatted from the above XML file for the LGR. It provides additional summary data and explanatory text. The XML file remains the sole normative specification of the LGR.
Date | 2024-01-09 |
---|---|
LGR Version | 16.0.0 |
Unicode Version | 16.0.0 |
Description
Partially updates
L2/19-329R
Alphabetic Characters not recommended in UTS#39 but part of the DNS Root Zone or Second-Level Reference LGR
This document has been submitted as a UTC document. For convenience in documenting the character list it is presented using an LGR template format. A few minor details of the boilerplate in that template may not be applicable in this context and should be disregarded.
This collection comprises 21 characters from ICANN's Root Zone LGR [RZ-LGR] (plus one character only found in the Second-Level Reference LGR [RefLGR]) that are neither Recommended or for optional Inclusion in UTS#39, as well as their uppercase equivalents, for a total of 38 characters.
Recommendation
These 38 characters should be given Identifier_Type Recommended in light of the extensive research documenting their use by languages with a robust enough infrastructure and usage to qualify for inclusion in the DNS Root Zone LGR as well as in Reference LGRs for the Second Level.
Rationale
The DNS Root Zone is among the most restricted public namespaces, with a mandate to combine security considerations with universal access. It makes little sense for Unicode's default identifiers to differ in the margins, particularly where it comes to letters within a particular script. The number of characters missing from Identifier_Type Recommended is a vanishingly small percentage of the total, and at this writing there is a much larger set of characters that not part of the [RZ-LGR] yet that have Identifier_Type Recommended based on much less thorough analysis.
While any cutoff of the set of characters to be recommended for general identifiers is by necessity somewhat arbitrary in the details, the vetting of characters for the [RZ-LGR] was a particularly thorough process (as described below), and the rationale for each and every included character is documented in the published files. Based on that, it would seem to call for compelling reasons to reject aligning the Identifier_Type of any of the characters presented here.
Background
ICANN has published a series of Second-Level Reference Label Generation Rules [RefLGR] for a variety of scripts and languages. Their combined repertoire is a superset of the combined repertoire for the Root Zone Label Generation Rules [RZ-LGR], mainly adding characters ineligible for the DNS Root Zone, but also HYPHEN-MINUS and various digit sets, but also some characters not or not yet supported in the Root Zone.
The starting point for the Root Zone repertoire is the combined set of IDNA2008 PVALID characters that are letters from any of the Recommended scripts, except Bopomofo, which the IDN community regards as special use (education only). From this set were further subtracted any characters deemed not in general use, such as “historic”, “obsolete”, “phonetic”, “religious use” or otherwise of uncommon use. This analysis relied on available documentation published or archived by the Unicode Consortium but was created independent of the Identifier_Type classification. The resulting Maximal Starting Repertoire [MSR] was publicly reviewed and published.
For each script in the Root Zone, an independent team of local experts reviewed the [MSR] and selected a subset suitable for domain names in the Root Zone for that script; see [RZ-LGR]. This review was based on whether any character could be documented as being used for a language in sufficiently widespread everyday common use. For the Han script, established registry practice was used to make the cutoff.
The [RefLGR] was based on the [RZ-LGR] by relaxing some of the restrictions specific to the Root Zone, besides ASCII and those native digits that are in common use, that included allowing some intra-word punctuation and joiners, as well as one character found in additional character sequences supported for Sinhala. In addition, the [RefLGR] supports the Thaana script, while support for that is in progress for the [RZ-LGR]. The Tibetan script is not yet supported, as work on reviewing the repertoire has not commenced. The [RefLGR] also supports two Limited_Use scripts; these are ignored in this analysis until such time as UTC receives and approves a proposal to change their status.
The subset of characters found in this document includes one character, U+0DA6 ඦ that is found in the [RefLGR] as part of an enumerated character sequence, but not in the [RZ-LGR].
The Root Zone and Reference LGRs, like the MSR, are specified in a format defined in RFC7940, which uses NFC and is limited to lowercase per IDNA2008. Nevertheless, the research on the usage of any cased character carries through to the case pair. Therefore, the uppercase equivalents of all [RefLGR] characters listed here are included in the recommendation above. A few characters are only used as the base character in a combining sequence in the context of a given language.
Note: All characters have tags matching their Identifier_Type values, except Uppercase equivalents are tagged as Uppercase. For all lowercase characters, a reference is given to a source document, from where references can be followed up to a discussion of selection criteria. Some characters may have additional references citing attestation of their use. A comment identifies the language(s) that prompted inclusion into the [RefLGR] as well as their [EGIDS] level, and occasionally, some other information.
Discussion and Review
A comparison of the characters to lists of exemplar characters in CLDR and SIL found a match for all but three of them. The comments and/or reference information have been updated to reflect additional reasons for their inclusion.
- U+068E ڎ ARABIC LETTER DUL - this is found in a Malaysian character standard for Jawi [JW]
- U+0DA6 ඦ SINHALA LETTER SANYAKA JAYANNA - the character is attested in modern use, although in a limited number of sequences.
- U+17CC ៌ KHMER SIGN ROBAT - this character is used with loan words. These words are commonly used, especially ព តមាន (vartamāna, Information), etc. For additional background and examples, see Section 5.4.7 “Robat Sign” in [Proposal-Khmer].
Contributors
This excerpt was prepared by Asmus Freytag, based on published data found in [RefLGR] and reference information from [MSR]. For details on the process and contributors to those projects, see [RefLGR-Overview], in particular, Section 1, “Overview” and Section 6, “Contributors”. Mark Davis, Michel Suignard and Roozbeh Pournader have contributed feedback.
Repertoire
Repertoire Summary
Number of elements in repertoire | 38 |
---|---|
Longest code point sequence | 1 |
Repertoire by Code Point
The following table lists the repertoire by code point (or code point sequence). The data in the Script and Name column are extracted from the Unicode character database. Where a comment in the original LGR is equal to the character name, it has been suppressed.
See also the legend provided below the table.
Code Point |
Glyph | Script | Name | Ref | Tags | Comment |
---|---|---|---|---|---|---|
U+0181 | Ɓ | Latin | LATIN CAPITAL LETTER B WITH HOOK | [CLDR] | Uppercase | |
U+0186 | Ɔ | Latin | LATIN CAPITAL LETTER OPEN O | [CLDR] | Uppercase | |
U+0189 | Ɖ | Latin | LATIN CAPITAL LETTER AFRICAN D | [CLDR] | Uppercase | |
U+018A | Ɗ | Latin | LATIN CAPITAL LETTER D WITH HOOK | [CLDR] | Uppercase | |
U+018E | Ǝ | Latin | LATIN CAPITAL LETTER REVERSED E | [CLDR] | Uppercase | |
U+0190 | Ɛ | Latin | LATIN CAPITAL LETTER OPEN E | [CLDR] | Uppercase | |
U+0191 | Ƒ | Latin | LATIN CAPITAL LETTER F WITH HOOK | [CLDR] | Uppercase | |
U+0192 | ƒ | Latin | LATIN SMALL LETTER F WITH HOOK | [118], [CLDR], [EW] | Uncommon_Use | Ewe (3) |
U+0194 | Ɣ | Latin | LATIN CAPITAL LETTER GAMMA | [CLDR] | Uppercase | |
U+0196 | Ɩ | Latin | LATIN CAPITAL LETTER IOTA | [CLDR] | Uppercase | |
U+0197 | Ɨ | Latin | LATIN CAPITAL LETTER I WITH STROKE | [CLDR] | Uppercase | |
U+0198 | Ƙ | Latin | LATIN CAPITAL LETTER K WITH HOOK | [CLDR] | Uppercase | |
U+0199 | ƙ | Latin | LATIN SMALL LETTER K WITH HOOK | [118], [CLDR] | Uncommon_Use | Hausa (2) |
U+019D | Ɲ | Latin | LATIN CAPITAL LETTER N WITH LEFT HOOK | [CLDR] | Uppercase | |
U+01B3 | Ƴ | Latin | LATIN CAPITAL LETTER Y WITH HOOK | [CLDR] | Uppercase | |
U+01B4 | ƴ | Latin | LATIN SMALL LETTER Y WITH HOOK | [118], [CLDR] | Uncommon_Use | Dagaare - Burkina Faso (4), Fula (3) |
U+01B7 | Ʒ | Latin | LATIN CAPITAL LETTER EZH | [CLDR] | Uppercase | |
U+01DD | ǝ | Latin | LATIN SMALL LETTER TURNED E | [118], [CLDR], [KA] | Uncommon_Use | Kanuri (3) |
U+0244 | Ʉ | Latin | LATIN CAPITAL LETTER U BAR | [CLDR] | Uppercase | |
U+024C | Ɍ | Latin | LATIN CAPITAL LETTER R WITH STROKE | [CLDR] | Uppercase | |
U+024D | ɍ | Latin | LATIN SMALL LETTER R WITH STROKE | [118], [CLDR], [KA] | Uncommon_Use | Kanuri (3) |
U+0253 | ɓ | Latin | LATIN SMALL LETTER B WITH HOOK | [118], [CLDR] | Uncommon_Use | Hausa (2), Dagaare - Burkina Faso (4), Pulaar (3) |
U+0254 | ɔ | Latin | LATIN SMALL LETTER OPEN O | [118], [CLDR], [DB] | Uncommon_Use | Dagaare - Burkina Faso (4), Dagbani (Dagomba) (4), Lingala (2), Akan (3), Ewondo (3), Fon (3), Nuer (4), Ga (4), Duala (3), EWE (3), Nuer (4) |
U+0256 | ɖ | Latin | LATIN SMALL LETTER D WITH TAIL | [118], [CLDR], [EW] | Uncommon_Use | Fon (3), Ewe (3) |
U+0257 | ɗ | Latin | LATIN SMALL LETTER D WITH HOOK | [118], [CLDR] | Uncommon_Use | Hausa (2), Fula (3) |
U+025B | ɛ | Latin | LATIN SMALL LETTER OPEN E | [118], [CLDR], [DB], [EW] | Uncommon_Use | Dagaare - Burkina Faso (4), Lingala (2), Akan (3), Ewondo (3), Dagbani (Dagomba), (4), Fon (3), Mossi (3), Ga (4), Ewe (3), Duala (3), Bambara (4), Nuer (4) |
U+0263 | ɣ | Latin | LATIN SMALL LETTER GAMMA | [118], [CLDR], [DB], [EW] | Uncommon_Use | Dagbani (Dagomba) (4), Nuer (4), Dinka (4), Ewe (3), Nuer (4) |
U+0268 | ɨ | Latin | LATIN SMALL LETTER I WITH STROKE | [118], [CLDR], [DB] | Uncommon_Use | Cubeo (3), Dagbani (Dagomba) (4), HIxkaryána (4), Maasai (5) |
U+0269 | ɩ | Latin | LATIN SMALL LETTER IOTA | [118], [CLDR] | Uncommon_Use | Dagaare - Burkina Faso (4), Mossi (3) |
U+0272 | ɲ | Latin | LATIN SMALL LETTER N WITH LEFT HOOK | [118], [CLDR] | Uncommon_Use | Susu (4), Zarma (4), Bambara (4) |
U+0289 | ʉ | Latin | LATIN SMALL LETTER U BAR | [118], [CLDR] | Uncommon_Use | Cubeo (3), Maasai (5) |
U+0292 | ʒ | Latin | LATIN SMALL LETTER EZH | [118], [CLDR], [DB], [FI] | Uncommon_Use | Skolt Sami (2), Dagbani (Dagomba) (4) |
U+068E | ڎ | Arabic | ARABIC LETTER DUL | [101], [JW] | Obsolete | Malay (1) |
U+0DA6 | ඦ | Sinhala | SINHALA LETTER SANYAKA JAYANNA | [122], [CLDR] | Technical, Uncommon_Use | Sinhala (1), see note in RZ-LGR proposal for Sinhala |
U+17CB | ់ | Khmer | KHMER SIGN BANTOC | [115], [CLDR], [KM] | Technical | Khmer, shortens certain vowels |
U+17CC | ៌ | Khmer | KHMER SIGN ROBAT | [115], [KM] | Technical | Khmer, used for loan words like ព តមាន (vartamāna, information) |
U+17CD | ៍ | Khmer | KHMER SIGN TOANDAKHIAT | [115], [CLDR], [KM] | Technical | Khmer, makes final consonant silent |
U+17D0 | ័ | Khmer | KHMER SIGN SAMYOK SANNYA | [115], [CLDR], [KM] | Technical | Khmer, indicates syllable contains a particular short vowel |
- Code Point
- A code point or code point sequence.
- Glyph
- The shape displayed depends on the fonts available to your browser.
- Script
- Shows the script property value from the Unicode Character Database. Combining marks may have the value Inherited and code points used with more than one script may have the value Common.
- Name
- Shows the character or sequence name from the Unicode Character Database.
- Ref
- Links to the references associated with the code point or sequence, if any.
- Tags
- LGR-defined tag values. Any tags matching the Unicode script property are suppressed in this view.
- Comment
- The comment as given in the XML file. However, if the comment for this row consists only of the code point or sequence name, it is suppressed in this view. By convention, comments starting with “=” denote an alias. If present, the symbol ⍟ marks a default item shared among a set of LGRs.
Variants
This LGR does not specify any variants.
Classes, Rules and Actions
Character Classes
Implicit (except script) | 4 |
---|
The following table lists all named and implicit classes with their definition and a list of their members intersected with the current repertoire (for larger classes, this list is elided).
Name | Definition | Count | Members or Ranges | Ref | Comment |
---|---|---|---|---|---|
implicit | Tag=Obsolete | 1 | {068E} | The character tagged as Obsolete | |
implicit | Tag=Technical | 5 | {0DA6 17CB-17CD 17D0} | Any character tagged as Technical | |
implicit | Tag=Uncommon_Use | 17 | {0192 0199 01B4 01DD 024D 0253-0254 0256-0257 025B 0263 0268-0269 0272 0289 0292 0DA6} | Any character tagged as Uncommon_Use | |
implicit | Tag=Uppercase | 16 | {0181 0186 0189-018A 018E 0190-0191 0194 0196-0198 019D 01B3 01B7 0244 024C} | Any character tagged as Uppercase |
- Members or Ranges
- Lists the members of the class as code points (xxx) or as ranges of code points (xxx-yyy). Any class too numerous to list in full is elided with "...".
- Tag=ttt
- A named or implicit class defined by all code points that share the given tag value (ttt).
- Implicit
- An anonymous class implicitly defined based on tag value and for which there is no named equivalent.
Whole label evaluation and context rules
The LGR does not define any rules.
Actions
The LGR does not define any actions.
Table of References
The following lists the references cited for specific code points, variants, classes, rules or actions in this LGR.
[EGIDS] | Lewis and Simons, EGIDS: Expanded Graded Intergenerational Disruption Scale,” documented in [SIL-Ethnologue] and summarized here: https://en.wikipedia.org/wiki/Expanded_Graded_Intergenerational_Disruption_Scale_(EGIDS) |
[MSR] | ICANN, “Maximal Starting Repertoire”, https://www.icann.org/resources/pages/msr-2015-06-21-en |
[Proposal-Khmer] | “Proposal for Khmer Script Root Zone LGR”, https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf |
[RefLGR] | ICANN, “Second-Level Reference Label Generation Rules”, https://www.icann.org/resources/pages/second-level-lgr-2015-06-21-en |
[RefLGR-Overview] | ICANN, “Reference Label Generation Rules (LGR) for the Second Level — Overview and Summary”, https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-overview-summary-25oct24-en.pdf |
[RZ-LGR] | ICANN, “Root Zone Label Generation Rules”, https://www.icann.org/resources/pages/root-zone-lgr-2015-06-21-en |
[SIL-Ethnologue] | David M. Eberhard, Gary F. Simons & Charles D. Fennig (eds.). 2021. Ethnologue: Languages of the World, Twenty fourth edition. Dallas, Texas: SIL International. Online version available as https://www.ethnologue.com |
[101] | Second Level Reference Label Generation Rules for the Arabic Script (und-Arab), 25 October 2024 (XML) https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-arabic-script-25oct24-en.xml non-normative HTML presentation: https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-arabic-script-25oct24-en.html |
[115] | Second Level Reference Label Generation Rules for the Khmer Script (und-Khmr), 25 October 2024 (XML) https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-khmer-script-25oct24-en.xml non-normative HTML presentation: https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-khmer-script-25oct24-en.html |
[118] | Second Level Reference Label Generation Rules for the Latin Script (und-Latn), 25 October 2024 (XML) https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-latin-full-variant-script-25oct24-en.xml non-normative HTML presentation: https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-latin-full-variant-script-25oct24-en.html |
[122] | Second Level Reference Label Generation Rules for the Sinhala Script (und-Sinh), 25 October 2024 (XML) https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-sinhala-script-25oct24-en.xml non-normative HTML presentation: https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-sinhala-script-25oct24-en.html |
[CLDR] | Common Locale Data Repository, https://cldr.unicode.org/ Attested in at least one set of exemplar characters for a modern language |
[DB] | Omniglot, Compiled by Wolfram Siegel, DAGBANI https://www.omniglot.com/charts/dagbani.pdf Accessed on 4 September 2018 |
[EW] | Omniglot, Ewe (Eʋegbe) https://www.omniglot.com/writing/ewe.htm |
[FI] | TRAFICOM, Finnish Transport and Communication Agency, Native Language Characters in domain names (fi-domain), https://www.traficom.fi/en/communications/fi-domains/native-language-characters-domain-names |
[JW] | Malay, Information technology - Jawi Coded Character Set for Information Interchange MS 2443:2012, Department of Standards, Malaysia. https://www.jsm.gov.my |
[KA] | Wikipedia, Kanuri Language, https://en.wikipedia.org/wiki/Kanuri_language and Omniglot, Kanuri (Kànùrí), https://www.omniglot.com/writing/kanuri.htm |
[KM] | Section 5.4 in Khmer Generation Panel, “Proposal for Khmer Script Root Zone LGR”, 15 August 2016, https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf |