LGR for unspecified language	Selected-RefLGR-but-not-recommended-IdentifierType

This document is mechanically formatted from the above XML file for the LGR. It provides additional summary data and explanatory text. The XML file remains the sole normative specification of the LGR.

Date	2024-01-09
LGR Version	16.0.0
Unicode Version	16.0.0

Description

L2/25-032
Partially updates
L2/19-329R

Alphabetic Characters not recommended in UTS#39 but part of the DNS Root Zone or Second-Level Reference LGR

This document has been submitted as a UTC document. For convenience in documenting the character list it is presented using an LGR template format. A few minor details of the boilerplate in that template may not be applicable in this context and should be disregarded.

This collection comprises 21 characters from ICANN's Root Zone LGR [RZ-LGR] (plus one character only found in the Second-Level Reference LGR [RefLGR]) that are neither Recommended or for optional Inclusion in UTS#39, as well as their uppercase equivalents, for a total of 38 characters.

Recommendation

These 38 characters should be given Identifier_Type Recommended in light of the extensive research documenting their use by languages with a robust enough infrastructure and usage to qualify for inclusion in the DNS Root Zone LGR as well as in Reference LGRs for the Second Level.

Rationale

The DNS Root Zone is among the most restricted public namespaces, with a mandate to combine security considerations with universal access. It makes little sense for Unicode's default identifiers to differ in the margins, particularly where it comes to letters within a particular script. The number of characters missing from Identifier_Type Recommended is a vanishingly small percentage of the total, and at this writing there is a much larger set of characters that not part of the [RZ-LGR] yet that have Identifier_Type Recommended based on much less thorough analysis.

While any cutoff of the set of characters to be recommended for general identifiers is by necessity somewhat arbitrary in the details, the vetting of characters for the [RZ-LGR] was a particularly thorough process (as described below), and the rationale for each and every included character is documented in the published files. Based on that, it would seem to call for compelling reasons to reject aligning the Identifier_Type of any of the characters presented here.

Background

ICANN has published a series of Second-Level Reference Label Generation Rules [RefLGR] for a variety of scripts and languages. Their combined repertoire is a superset of the combined repertoire for the Root Zone Label Generation Rules [RZ-LGR], mainly adding characters ineligible for the DNS Root Zone, but also HYPHEN-MINUS and various digit sets, but also some characters not or not yet supported in the Root Zone.

The starting point for the Root Zone repertoire is the combined set of IDNA2008 PVALID characters that are letters from any of the Recommended scripts, except Bopomofo, which the IDN community regards as special use (education only). From this set were further subtracted any characters deemed not in general use, such as “historic”, “obsolete”, “phonetic”, “religious use” or otherwise of uncommon use. This analysis relied on available documentation published or archived by the Unicode Consortium but was created independent of the Identifier_Type classification. The resulting Maximal Starting Repertoire [MSR] was publicly reviewed and published.

For each script in the Root Zone, an independent team of local experts reviewed the [MSR] and selected a subset suitable for domain names in the Root Zone for that script; see [RZ-LGR]. This review was based on whether any character could be documented as being used for a language in sufficiently widespread everyday common use. For the Han script, established registry practice was used to make the cutoff.

The [RefLGR] was based on the [RZ-LGR] by relaxing some of the restrictions specific to the Root Zone, besides ASCII and those native digits that are in common use, that included allowing some intra-word punctuation and joiners, as well as one character found in additional character sequences supported for Sinhala. In addition, the [RefLGR] supports the Thaana script, while support for that is in progress for the [RZ-LGR]. The Tibetan script is not yet supported, as work on reviewing the repertoire has not commenced. The [RefLGR] also supports two Limited_Use scripts; these are ignored in this analysis until such time as UTC receives and approves a proposal to change their status.

The subset of characters found in this document includes one character, U+0DA6 ඦ that is found in the [RefLGR] as part of an enumerated character sequence, but not in the [RZ-LGR].

The Root Zone and Reference LGRs, like the MSR, are specified in a format defined in RFC7940, which uses NFC and is limited to lowercase per IDNA2008. Nevertheless, the research on the usage of any cased character carries through to the case pair. Therefore, the uppercase equivalents of all [RefLGR] characters listed here are included in the recommendation above. A few characters are only used as the base character in a combining sequence in the context of a given language.

Note: All characters have tags matching their Identifier_Type values, except Uppercase equivalents are tagged as Uppercase. For all lowercase characters, a reference is given to a source document, from where references can be followed up to a discussion of selection criteria. Some characters may have additional references citing attestation of their use. A comment identifies the language(s) that prompted inclusion into the [RefLGR] as well as their [EGIDS] level, and occasionally, some other information.

Discussion and Review

A comparison of the characters to lists of exemplar characters in CLDR and SIL found a match for all but three of them. The comments and/or reference information have been updated to reflect additional reasons for their inclusion.

U+068E ڎ ARABIC LETTER DUL - this is found in a Malaysian character standard for Jawi [JW]
U+0DA6 ඦ SINHALA LETTER SANYAKA JAYANNA - the character is attested in modern use, although in a limited number of sequences.
U+17CC ៌ KHMER SIGN ROBAT - this character is used with loan words. These words are commonly used, especially ព តមាន (vartamāna, Information), etc. For additional background and examples, see Section 5.4.7 “Robat Sign” in [Proposal-Khmer].

Contributors

This excerpt was prepared by Asmus Freytag, based on published data found in [RefLGR] and reference information from [MSR]. For details on the process and contributors to those projects, see [RefLGR-Overview], in particular, Section 1, “Overview” and Section 6, “Contributors”. Mark Davis, Michel Suignard and Roozbeh Pournader have contributed feedback.

Repertoire

Repertoire Summary

Number of elements in repertoire	38
Longest code point sequence	1

Repertoire by Code Point

The following table lists the repertoire by code point (or code point sequence). The data in the Script and Name column are extracted from the Unicode character database. Where a comment in the original LGR is equal to the character name, it has been suppressed.

See also the legend provided below the table.

Code Point	Glyph	Script	Name	Ref	Tags	Comment
U+0181	Ɓ	Latin	LATIN CAPITAL LETTER B WITH HOOK	[CLDR]	Uppercase
U+0186	Ɔ	Latin	LATIN CAPITAL LETTER OPEN O	[CLDR]	Uppercase
U+0189	Ɖ	Latin	LATIN CAPITAL LETTER AFRICAN D	[CLDR]	Uppercase
U+018A	Ɗ	Latin	LATIN CAPITAL LETTER D WITH HOOK	[CLDR]	Uppercase
U+018E	Ǝ	Latin	LATIN CAPITAL LETTER REVERSED E	[CLDR]	Uppercase
U+0190	Ɛ	Latin	LATIN CAPITAL LETTER OPEN E	[CLDR]	Uppercase
U+0191	Ƒ	Latin	LATIN CAPITAL LETTER F WITH HOOK	[CLDR]	Uppercase
U+0192	ƒ	Latin	LATIN SMALL LETTER F WITH HOOK	[118], [CLDR], [EW]	Uncommon_Use	Ewe (3)
U+0194	Ɣ	Latin	LATIN CAPITAL LETTER GAMMA	[CLDR]	Uppercase
U+0196	Ɩ	Latin	LATIN CAPITAL LETTER IOTA	[CLDR]	Uppercase
U+0197	Ɨ	Latin	LATIN CAPITAL LETTER I WITH STROKE	[CLDR]	Uppercase
U+0198	Ƙ	Latin	LATIN CAPITAL LETTER K WITH HOOK	[CLDR]	Uppercase
U+0199	ƙ	Latin	LATIN SMALL LETTER K WITH HOOK	[118], [CLDR]	Uncommon_Use	Hausa (2)
U+019D	Ɲ	Latin	LATIN CAPITAL LETTER N WITH LEFT HOOK	[CLDR]	Uppercase
U+01B3	Ƴ	Latin	LATIN CAPITAL LETTER Y WITH HOOK	[CLDR]	Uppercase
U+01B4	ƴ	Latin	LATIN SMALL LETTER Y WITH HOOK	[118], [CLDR]	Uncommon_Use	Dagaare - Burkina Faso (4), Fula (3)
U+01B7	Ʒ	Latin	LATIN CAPITAL LETTER EZH	[CLDR]	Uppercase
U+01DD	ǝ	Latin	LATIN SMALL LETTER TURNED E	[118], [CLDR], [KA]	Uncommon_Use	Kanuri (3)
U+0244	Ʉ	Latin	LATIN CAPITAL LETTER U BAR	[CLDR]	Uppercase
U+024C	Ɍ	Latin	LATIN CAPITAL LETTER R WITH STROKE	[CLDR]	Uppercase
U+024D	ɍ	Latin	LATIN SMALL LETTER R WITH STROKE	[118], [CLDR], [KA]	Uncommon_Use	Kanuri (3)
U+0253	ɓ	Latin	LATIN SMALL LETTER B WITH HOOK	[118], [CLDR]	Uncommon_Use	Hausa (2), Dagaare - Burkina Faso (4), Pulaar (3)
U+0254	ɔ	Latin	LATIN SMALL LETTER OPEN O	[118], [CLDR], [DB]	Uncommon_Use	Dagaare - Burkina Faso (4), Dagbani (Dagomba) (4), Lingala (2), Akan (3), Ewondo (3), Fon (3), Nuer (4), Ga (4), Duala (3), EWE (3), Nuer (4)
U+0256	ɖ	Latin	LATIN SMALL LETTER D WITH TAIL	[118], [CLDR], [EW]	Uncommon_Use	Fon (3), Ewe (3)
U+0257	ɗ	Latin	LATIN SMALL LETTER D WITH HOOK	[118], [CLDR]	Uncommon_Use	Hausa (2), Fula (3)
U+025B	ɛ	Latin	LATIN SMALL LETTER OPEN E	[118], [CLDR], [DB], [EW]	Uncommon_Use	Dagaare - Burkina Faso (4), Lingala (2), Akan (3), Ewondo (3), Dagbani (Dagomba), (4), Fon (3), Mossi (3), Ga (4), Ewe (3), Duala (3), Bambara (4), Nuer (4)
U+0263	ɣ	Latin	LATIN SMALL LETTER GAMMA	[118], [CLDR], [DB], [EW]	Uncommon_Use	Dagbani (Dagomba) (4), Nuer (4), Dinka (4), Ewe (3), Nuer (4)
U+0268	ɨ	Latin	LATIN SMALL LETTER I WITH STROKE	[118], [CLDR], [DB]	Uncommon_Use	Cubeo (3), Dagbani (Dagomba) (4), HIxkaryána (4), Maasai (5)
U+0269	ɩ	Latin	LATIN SMALL LETTER IOTA	[118], [CLDR]	Uncommon_Use	Dagaare - Burkina Faso (4), Mossi (3)
U+0272	ɲ	Latin	LATIN SMALL LETTER N WITH LEFT HOOK	[118], [CLDR]	Uncommon_Use	Susu (4), Zarma (4), Bambara (4)
U+0289	ʉ	Latin	LATIN SMALL LETTER U BAR	[118], [CLDR]	Uncommon_Use	Cubeo (3), Maasai (5)
U+0292	ʒ	Latin	LATIN SMALL LETTER EZH	[118], [CLDR], [DB], [FI]	Uncommon_Use	Skolt Sami (2), Dagbani (Dagomba) (4)
U+068E	ڎ	Arabic	ARABIC LETTER DUL	[101], [JW]	Obsolete	Malay (1)
U+0DA6	ඦ	Sinhala	SINHALA LETTER SANYAKA JAYANNA	[122], [CLDR]	Technical, Uncommon_Use	Sinhala (1), see note in RZ-LGR proposal for Sinhala
U+17CB	់	Khmer	KHMER SIGN BANTOC	[115], [CLDR], [KM]	Technical	Khmer, shortens certain vowels
U+17CC	៌	Khmer	KHMER SIGN ROBAT	[115], [KM]	Technical	Khmer, used for loan words like ព តមាន (vartamāna, information)
U+17CD	៍	Khmer	KHMER SIGN TOANDAKHIAT	[115], [CLDR], [KM]	Technical	Khmer, makes final consonant silent
U+17D0	័	Khmer	KHMER SIGN SAMYOK SANNYA	[115], [CLDR], [KM]	Technical	Khmer, indicates syllable contains a particular short vowel

Legend

Code Point: A code point or code point sequence.
Glyph: The shape displayed depends on the fonts available to your browser.
Script: Shows the script property value from the Unicode Character Database. Combining marks may have the value Inherited and code points used with more than one script may have the value Common.
Name: Shows the character or sequence name from the Unicode Character Database.
Ref: Links to the references associated with the code point or sequence, if any.
Tags: LGR-defined tag values. Any tags matching the Unicode script property are suppressed in this view.
Comment: The comment as given in the XML file. However, if the comment for this row consists only of the code point or sequence name, it is suppressed in this view. By convention, comments starting with “=” denote an alias. If present, the symbol ⍟ marks a default item shared among a set of LGRs.

Variants

This LGR does not specify any variants.

Classes, Rules and Actions

Character Classes

Implicit (except script)	4

The following table lists all named and implicit classes with their definition and a list of their members intersected with the current repertoire (for larger classes, this list is elided).

Name	Definition	Count	Members or Ranges	Comment
implicit	Tag=Obsolete	1	{068E}	The character tagged as Obsolete
implicit	Tag=Technical	5	{0DA6 17CB-17CD 17D0}	Any character tagged as Technical
implicit	Tag=Uncommon_Use	17	{0192 0199 01B4 01DD 024D 0253-0254 0256-0257 025B 0263 0268-0269 0272 0289 0292 0DA6}	Any character tagged as Uncommon_Use
implicit	Tag=Uppercase	16	{0181 0186 0189-018A 018E 0190-0191 0194 0196-0198 019D 01B3 01B7 0244 024C}	Any character tagged as Uppercase

Legend

Members or Ranges: Lists the members of the class as code points (xxx) or as ranges of code points (xxx-yyy). Any class too numerous to list in full is elided with "...".
Tag=ttt: A named or implicit class defined by all code points that share the given tag value (ttt).
Implicit: An anonymous class implicitly defined based on tag value and for which there is no named equivalent.

Whole label evaluation and context rules

The LGR does not define any rules.

Actions

The LGR does not define any actions.

Table of References

The following lists the references cited for specific code points, variants, classes, rules or actions in this LGR.

[EGIDS]	Lewis and Simons, EGIDS: Expanded Graded Intergenerational Disruption Scale,” documented in [SIL-Ethnologue] and summarized here: https://en.wikipedia.org/wiki/Expanded_Graded_Intergenerational_Disruption_Scale_(EGIDS)
[MSR]	ICANN, “Maximal Starting Repertoire”, https://www.icann.org/resources/pages/msr-2015-06-21-en
[Proposal-Khmer]	“Proposal for Khmer Script Root Zone LGR”, https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf
[RefLGR]	ICANN, “Second-Level Reference Label Generation Rules”, https://www.icann.org/resources/pages/second-level-lgr-2015-06-21-en
[RefLGR-Overview]	ICANN, “Reference Label Generation Rules (LGR) for the Second Level — Overview and Summary”, https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-overview-summary-25oct24-en.pdf
[RZ-LGR]	ICANN, “Root Zone Label Generation Rules”, https://www.icann.org/resources/pages/root-zone-lgr-2015-06-21-en
[SIL-Ethnologue]	David M. Eberhard, Gary F. Simons & Charles D. Fennig (eds.). 2021. Ethnologue: Languages of the World, Twenty fourth edition. Dallas, Texas: SIL International. Online version available as https://www.ethnologue.com
[101]	Second Level Reference Label Generation Rules for the Arabic Script (und-Arab), 25 October 2024 (XML) https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-arabic-script-25oct24-en.xml non-normative HTML presentation: https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-arabic-script-25oct24-en.html
[115]	Second Level Reference Label Generation Rules for the Khmer Script (und-Khmr), 25 October 2024 (XML) https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-khmer-script-25oct24-en.xml non-normative HTML presentation: https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-khmer-script-25oct24-en.html
[118]	Second Level Reference Label Generation Rules for the Latin Script (und-Latn), 25 October 2024 (XML) https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-latin-full-variant-script-25oct24-en.xml non-normative HTML presentation: https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-latin-full-variant-script-25oct24-en.html
[122]	Second Level Reference Label Generation Rules for the Sinhala Script (und-Sinh), 25 October 2024 (XML) https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-sinhala-script-25oct24-en.xml non-normative HTML presentation: https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-sinhala-script-25oct24-en.html
[CLDR]	Common Locale Data Repository, https://cldr.unicode.org/ Attested in at least one set of exemplar characters for a modern language
[DB]	Omniglot, Compiled by Wolfram Siegel, DAGBANI https://www.omniglot.com/charts/dagbani.pdf Accessed on 4 September 2018
[EW]	Omniglot, Ewe (Eʋegbe) https://www.omniglot.com/writing/ewe.htm
[FI]	TRAFICOM, Finnish Transport and Communication Agency, Native Language Characters in domain names (fi-domain), https://www.traficom.fi/en/communications/fi-domains/native-language-characters-domain-names
[JW]	Malay, Information technology - Jawi Coded Character Set for Information Interchange MS 2443:2012, Department of Standards, Malaysia. https://www.jsm.gov.my
[KA]	Wikipedia, Kanuri Language, https://en.wikipedia.org/wiki/Kanuri_language and Omniglot, Kanuri (Kànùrí), https://www.omniglot.com/writing/kanuri.htm
[KM]	Section 5.4 in Khmer Generation Panel, “Proposal for Khmer Script Root Zone LGR”, 15 August 2016, https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf