LGR for unspecified language Selected-RefLGR-but-not-recommended-IdentifierType

This document is mechanically formatted from the above XML file for the LGR. It provides additional summary data and explanatory text. The XML file remains the sole normative specification of the LGR.

Date 2024-01-09
LGR Version 16.0.0
Unicode Version 16.0.0

Description

L2/25-032
Partially updates
L2/19-329R

Alphabetic Characters not recommended in UTS#39 but part of the DNS Root Zone or Second-Level Reference LGR

This document has been submitted as a UTC document. For convenience in documenting the character list it is presented using an LGR template format. A few minor details of the boilerplate in that template may not be applicable in this context and should be disregarded.

This collection comprises 21 characters from ICANN's Root Zone LGR [RZ-LGR] (plus one character only found in the Second-Level Reference LGR [RefLGR]) that are neither Recommended or for optional Inclusion in UTS#39, as well as their uppercase equivalents, for a total of 38 characters.

Recommendation

These 38 characters should be given Identifier_Type Recommended in light of the extensive research documenting their use by languages with a robust enough infrastructure and usage to qualify for inclusion in the DNS Root Zone LGR as well as in Reference LGRs for the Second Level.

Rationale

The DNS Root Zone is among the most restricted public namespaces, with a mandate to combine security considerations with universal access. It makes little sense for Unicode's default identifiers to differ in the margins, particularly where it comes to letters within a particular script. The number of characters missing from Identifier_Type Recommended is a vanishingly small percentage of the total, and at this writing there is a much larger set of characters that not part of the [RZ-LGR] yet that have Identifier_Type Recommended based on much less thorough analysis.

While any cutoff of the set of characters to be recommended for general identifiers is by necessity somewhat arbitrary in the details, the vetting of characters for the [RZ-LGR] was a particularly thorough process (as described below), and the rationale for each and every included character is documented in the published files. Based on that, it would seem to call for compelling reasons to reject aligning the Identifier_Type of any of the characters presented here.

Background

ICANN has published a series of Second-Level Reference Label Generation Rules [RefLGR] for a variety of scripts and languages. Their combined repertoire is a superset of the combined repertoire for the Root Zone Label Generation Rules [RZ-LGR], mainly adding characters ineligible for the DNS Root Zone, but also HYPHEN-MINUS and various digit sets, but also some characters not or not yet supported in the Root Zone.

The starting point for the Root Zone repertoire is the combined set of IDNA2008 PVALID characters that are letters from any of the Recommended scripts, except Bopomofo, which the IDN community regards as special use (education only). From this set were further subtracted any characters deemed not in general use, such as “historic”, “obsolete”, “phonetic”, “religious use” or otherwise of uncommon use. This analysis relied on available documentation published or archived by the Unicode Consortium but was created independent of the Identifier_Type classification. The resulting Maximal Starting Repertoire [MSR] was publicly reviewed and published.

For each script in the Root Zone, an independent team of local experts reviewed the [MSR] and selected a subset suitable for domain names in the Root Zone for that script; see [RZ-LGR]. This review was based on whether any character could be documented as being used for a language in sufficiently widespread everyday common use. For the Han script, established registry practice was used to make the cutoff.

The [RefLGR] was based on the [RZ-LGR] by relaxing some of the restrictions specific to the Root Zone, besides ASCII and those native digits that are in common use, that included allowing some intra-word punctuation and joiners, as well as one character found in additional character sequences supported for Sinhala. In addition, the [RefLGR] supports the Thaana script, while support for that is in progress for the [RZ-LGR]. The Tibetan script is not yet supported, as work on reviewing the repertoire has not commenced. The [RefLGR] also supports two Limited_Use scripts; these are ignored in this analysis until such time as UTC receives and approves a proposal to change their status.

The subset of characters found in this document includes one character, U+0DA6 that is found in the [RefLGR] as part of an enumerated character sequence, but not in the [RZ-LGR].

The Root Zone and Reference LGRs, like the MSR, are specified in a format defined in RFC7940, which uses NFC and is limited to lowercase per IDNA2008. Nevertheless, the research on the usage of any cased character carries through to the case pair. Therefore, the uppercase equivalents of all [RefLGR] characters listed here are included in the recommendation above. A few characters are only used as the base character in a combining sequence in the context of a given language.

Note: All characters have tags matching their Identifier_Type values, except Uppercase equivalents are tagged as Uppercase. For all lowercase characters, a reference is given to a source document, from where references can be followed up to a discussion of selection criteria. Some characters may have additional references citing attestation of their use. A comment identifies the language(s) that prompted inclusion into the [RefLGR] as well as their [EGIDS] level, and occasionally, some other information.

Discussion and Review

A comparison of the characters to lists of exemplar characters in CLDR and SIL found a match for all but three of them. The comments and/or reference information have been updated to reflect additional reasons for their inclusion.

  • U+068E ڎ ARABIC LETTER DUL - this is found in a Malaysian character standard for Jawi [JW]
  • U+0DA6 SINHALA LETTER SANYAKA JAYANNA - the character is attested in modern use, although in a limited number of sequences.
  • U+17CC  ៌  KHMER SIGN ROBAT - this character is used with loan words. These words are commonly used, especially ព តមាន (vartamāna, Information), etc. For additional background and examples, see Section 5.4.7 “Robat Sign” in [Proposal-Khmer].

Contributors

This excerpt was prepared by Asmus Freytag, based on published data found in [RefLGR] and reference information from [MSR]. For details on the process and contributors to those projects, see [RefLGR-Overview], in particular, Section 1, “Overview” and Section 6, “Contributors”. Mark Davis, Michel Suignard and Roozbeh Pournader have contributed feedback.

Repertoire

Repertoire Summary

Number of elements in repertoire 38
Longest code point sequence 1

Repertoire by Code Point

The following table lists the repertoire by code point (or code point sequence). The data in the Script and Name column are extracted from the Unicode character database. Where a comment in the original LGR is equal to the character name, it has been suppressed.

See also the legend provided below the table.

Code
Point
Glyph Script Name Ref Tags Comment
U+0181 Ɓ Latin LATIN CAPITAL LETTER B WITH HOOK [CLDR] Uppercase  
U+0186 Ɔ Latin LATIN CAPITAL LETTER OPEN O [CLDR] Uppercase  
U+0189 Ɖ Latin LATIN CAPITAL LETTER AFRICAN D [CLDR] Uppercase  
U+018A Ɗ Latin LATIN CAPITAL LETTER D WITH HOOK [CLDR] Uppercase  
U+018E Ǝ Latin LATIN CAPITAL LETTER REVERSED E [CLDR] Uppercase  
U+0190 Ɛ Latin LATIN CAPITAL LETTER OPEN E [CLDR] Uppercase  
U+0191 Ƒ Latin LATIN CAPITAL LETTER F WITH HOOK [CLDR] Uppercase  
U+0192 ƒ Latin LATIN SMALL LETTER F WITH HOOK [118], [CLDR], [EW] Uncommon_Use Ewe (3)
U+0194 Ɣ Latin LATIN CAPITAL LETTER GAMMA [CLDR] Uppercase  
U+0196 Ɩ Latin LATIN CAPITAL LETTER IOTA [CLDR] Uppercase  
U+0197 Ɨ Latin LATIN CAPITAL LETTER I WITH STROKE [CLDR] Uppercase  
U+0198 Ƙ Latin LATIN CAPITAL LETTER K WITH HOOK [CLDR] Uppercase  
U+0199 ƙ Latin LATIN SMALL LETTER K WITH HOOK [118], [CLDR] Uncommon_Use Hausa (2)
U+019D Ɲ Latin LATIN CAPITAL LETTER N WITH LEFT HOOK [CLDR] Uppercase  
U+01B3 Ƴ Latin LATIN CAPITAL LETTER Y WITH HOOK [CLDR] Uppercase  
U+01B4 ƴ Latin LATIN SMALL LETTER Y WITH HOOK [118], [CLDR] Uncommon_Use Dagaare - Burkina Faso (4), Fula (3)
U+01B7 Ʒ Latin LATIN CAPITAL LETTER EZH [CLDR] Uppercase  
U+01DD ǝ Latin LATIN SMALL LETTER TURNED E [118], [CLDR], [KA] Uncommon_Use Kanuri (3)
U+0244 Ʉ Latin LATIN CAPITAL LETTER U BAR [CLDR] Uppercase  
U+024C Ɍ Latin LATIN CAPITAL LETTER R WITH STROKE [CLDR] Uppercase  
U+024D ɍ Latin LATIN SMALL LETTER R WITH STROKE [118], [CLDR], [KA] Uncommon_Use Kanuri (3)
U+0253 ɓ Latin LATIN SMALL LETTER B WITH HOOK [118], [CLDR] Uncommon_Use Hausa (2), Dagaare - Burkina Faso (4), Pulaar (3)
U+0254 ɔ Latin LATIN SMALL LETTER OPEN O [118], [CLDR], [DB] Uncommon_Use Dagaare - Burkina Faso (4), Dagbani (Dagomba) (4), Lingala (2), Akan (3), Ewondo (3), Fon (3), Nuer (4), Ga (4), Duala (3), EWE (3), Nuer (4)
U+0256 ɖ Latin LATIN SMALL LETTER D WITH TAIL [118], [CLDR], [EW] Uncommon_Use Fon (3), Ewe (3)
U+0257 ɗ Latin LATIN SMALL LETTER D WITH HOOK [118], [CLDR] Uncommon_Use Hausa (2), Fula (3)
U+025B ɛ Latin LATIN SMALL LETTER OPEN E [118], [CLDR], [DB], [EW] Uncommon_Use Dagaare - Burkina Faso (4), Lingala (2), Akan (3), Ewondo (3), Dagbani (Dagomba), (4), Fon (3), Mossi (3), Ga (4), Ewe (3), Duala (3), Bambara (4), Nuer (4)
U+0263 ɣ Latin LATIN SMALL LETTER GAMMA [118], [CLDR], [DB], [EW] Uncommon_Use Dagbani (Dagomba) (4), Nuer (4), Dinka (4), Ewe (3), Nuer (4)
U+0268 ɨ Latin LATIN SMALL LETTER I WITH STROKE [118], [CLDR], [DB] Uncommon_Use Cubeo (3), Dagbani (Dagomba) (4), HIxkaryána (4), Maasai (5)
U+0269 ɩ Latin LATIN SMALL LETTER IOTA [118], [CLDR] Uncommon_Use Dagaare - Burkina Faso (4), Mossi (3)
U+0272 ɲ Latin LATIN SMALL LETTER N WITH LEFT HOOK [118], [CLDR] Uncommon_Use Susu (4), Zarma (4), Bambara (4)
U+0289 ʉ Latin LATIN SMALL LETTER U BAR [118], [CLDR] Uncommon_Use Cubeo (3), Maasai (5)
U+0292 ʒ Latin LATIN SMALL LETTER EZH [118], [CLDR], [DB], [FI] Uncommon_Use Skolt Sami (2), Dagbani (Dagomba) (4)
U+068E ڎ Arabic ARABIC LETTER DUL [101], [JW] Obsolete Malay (1)
U+0DA6 Sinhala SINHALA LETTER SANYAKA JAYANNA [122], [CLDR] Technical, Uncommon_Use Sinhala (1), see note in RZ-LGR proposal for Sinhala
U+17CB  ់ Khmer KHMER SIGN BANTOC [115], [CLDR], [KM] Technical Khmer, shortens certain vowels
U+17CC  ៌ Khmer KHMER SIGN ROBAT [115], [KM] Technical Khmer, used for loan words like ព តមាន (vartamāna, information)
U+17CD  ៍ Khmer KHMER SIGN TOANDAKHIAT [115], [CLDR], [KM] Technical Khmer, makes final consonant silent
U+17D0  ័ Khmer KHMER SIGN SAMYOK SANNYA [115], [CLDR], [KM] Technical Khmer, indicates syllable contains a particular short vowel

Legend

Code Point
A code point or code point sequence.
Glyph
The shape displayed depends on the fonts available to your browser.
Script
Shows the script property value from the Unicode Character Database. Combining marks may have the value Inherited and code points used with more than one script may have the value Common.
Name
Shows the character or sequence name from the Unicode Character Database.
Ref
Links to the references associated with the code point or sequence, if any.
Tags
LGR-defined tag values. Any tags matching the Unicode script property are suppressed in this view.
Comment
The comment as given in the XML file. However, if the comment for this row consists only of the code point or sequence name, it is suppressed in this view. By convention, comments starting with “=” denote an alias. If present, the symbol ⍟ marks a default item shared among a set of LGRs.

Variants

This LGR does not specify any variants.

Classes, Rules and Actions

Character Classes

Implicit (except script) 4

The following table lists all named and implicit classes with their definition and a list of their members intersected with the current repertoire (for larger classes, this list is elided).

Name Definition Count Members or Ranges Ref Comment
implicit Tag=Obsolete 1 {068E}   The character tagged as Obsolete
implicit Tag=Technical 5 {0DA6 17CB-17CD 17D0}   Any character tagged as Technical
implicit Tag=Uncommon_Use 17 {0192 0199 01B4 01DD 024D 0253-0254 0256-0257 025B 0263 0268-0269 0272 0289 0292 0DA6}   Any character tagged as Uncommon_Use
implicit Tag=Uppercase 16 {0181 0186 0189-018A 018E 0190-0191 0194 0196-0198 019D 01B3 01B7 0244 024C}   Any character tagged as Uppercase

Legend

Members or Ranges
Lists the members of the class as code points (xxx) or as ranges of code points (xxx-yyy). Any class too numerous to list in full is elided with "...".
Tag=ttt
A named or implicit class defined by all code points that share the given tag value (ttt).
Implicit
An anonymous class implicitly defined based on tag value and for which there is no named equivalent.

Whole label evaluation and context rules

The LGR does not define any rules.

Actions

The LGR does not define any actions.

Table of References

The following lists the references cited for specific code points, variants, classes, rules or actions in this LGR.

[EGIDS] Lewis and Simons, EGIDS: Expanded Graded Intergenerational Disruption Scale,” documented in [SIL-Ethnologue] and summarized here:
https://en.wikipedia.org/wiki/Expanded_Graded_Intergenerational_Disruption_Scale_(EGIDS)
[MSR] ICANN, “Maximal Starting Repertoire”,
https://www.icann.org/resources/pages/msr-2015-06-21-en
[Proposal-Khmer] “Proposal for Khmer Script Root Zone LGR”,
https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf
[RefLGR] ICANN, “Second-Level Reference Label Generation Rules”,
https://www.icann.org/resources/pages/second-level-lgr-2015-06-21-en
[RefLGR-Overview] ICANN, “Reference Label Generation Rules (LGR) for the Second Level — Overview and Summary”,
https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-overview-summary-25oct24-en.pdf
[RZ-LGR] ICANN, “Root Zone Label Generation Rules”,
https://www.icann.org/resources/pages/root-zone-lgr-2015-06-21-en
[SIL-Ethnologue] David M. Eberhard, Gary F. Simons & Charles D. Fennig (eds.). 2021. Ethnologue: Languages of the World, Twenty fourth edition. Dallas, Texas: SIL International. Online version available as
https://www.ethnologue.com
[101] Second Level Reference Label Generation Rules for the Arabic Script (und-Arab), 25 October 2024 (XML)
https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-arabic-script-25oct24-en.xml
non-normative HTML presentation:
https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-arabic-script-25oct24-en.html
[115] Second Level Reference Label Generation Rules for the Khmer Script (und-Khmr), 25 October 2024 (XML)
https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-khmer-script-25oct24-en.xml
non-normative HTML presentation:
https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-khmer-script-25oct24-en.html
[118] Second Level Reference Label Generation Rules for the Latin Script (und-Latn), 25 October 2024 (XML)
https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-latin-full-variant-script-25oct24-en.xml
non-normative HTML presentation:
https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-latin-full-variant-script-25oct24-en.html
[122] Second Level Reference Label Generation Rules for the Sinhala Script (und-Sinh), 25 October 2024 (XML)
https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-sinhala-script-25oct24-en.xml
non-normative HTML presentation:
https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-sinhala-script-25oct24-en.html
[CLDR] Common Locale Data Repository,
https://cldr.unicode.org/
Attested in at least one set of exemplar characters for a modern language
[DB] Omniglot, Compiled by Wolfram Siegel, DAGBANI
https://www.omniglot.com/charts/dagbani.pdf
Accessed on 4 September 2018
[EW] Omniglot, Ewe (Eʋegbe)
https://www.omniglot.com/writing/ewe.htm
[FI] TRAFICOM, Finnish Transport and Communication Agency, Native Language Characters in domain names (fi-domain),
https://www.traficom.fi/en/communications/fi-domains/native-language-characters-domain-names
[JW] Malay, Information technology - Jawi Coded Character Set for Information Interchange MS 2443:2012, Department of Standards, Malaysia.
https://www.jsm.gov.my
[KA] Wikipedia, Kanuri Language,
https://en.wikipedia.org/wiki/Kanuri_language
and Omniglot, Kanuri (Kànùrí),
https://www.omniglot.com/writing/kanuri.htm
[KM] Section 5.4 in Khmer Generation Panel, “Proposal for Khmer Script Root Zone LGR”, 15 August 2016,
https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf