PRI #299 Background: Representing Additional Types of Flags

Last updated: 2015-06-29

A. Background

The Unicode Standard already provides a mechanism to represent flags using pairs of REGIONAL INDICATOR symbols U+1F1E6..U+1F1FF, which were added in Unicode 6.0. The mechanism is documented in the current text of the standard and covered in Annex B of UTR #51.

On several systems, pairs of REGIONAL INDICATOR symbols are used to represent up to more than 200 flags as emoji. These pairs correspond to unicode_region_subtag two-letter codes, which can represent some regions such as Isle of Man, Guernsey, and Puerto Rico but not others, such as England, Scotland, Wales, U.S. states, or Russian Federation republics.

The unicode_region_subtags defined in CLDR are based on BCP47, which is in turn based on ISO 3166-1 and UN M49 codes.

On some platforms that support a number of emoji flags, there is substantial demand to support additional flags beyond those defined for unicode_region_subtags, such as for the following:

“Country subdivisions” such as England, Scotland, Wales, U.S. states, and Russian Federation republics.
Certain supra-national organizations such the United Nations.
Certain 3 digit regional codes, used in BCP47 to stabilize ISO 3166-1 codes.

B. Proposal

This proposal describes a mechanism for representing unicode_region_subtags and unicode_subdivision_subtags with TAG characters for designating flags. The proposal will extend UTR #51 and includes several parts:

Use the TAG characters E0030..E0039 (TAG DIGITs) and E0041..E005A (TAG LATIN CAPITAL LETTERs). These TAG characters are default-ignorable, without any visible representation by themselves.
Designate the character U+1F3F3 WAVING WHITE FLAG as the “base” for a subsequent sequence of TAG characters. This character already encoded, with general category value So.
- This base character is a visible spacing character that suggests a flag, so that implementations that do not support the TAG characters have an indication that a flag is present.
Define valid sequences as the base character followed by a sequence of TAG characters, as specified by either of the following conditions:
- A sequence of three TAG DIGIT characters designating a unicode_region_subtag.
  - This is needed primarily to handle cases in which ISO 3166-1 reassigns 2-letter codes that are already in the BCP 47 registry. It may also be used to designate flags for certain supra-national organizations. See the discussion below.
- A sequence of TAG characters that correspond to a valid unicode_region_subtag followed by a unicode_subdivision_subtag, where the latter is valid for that region.
  - The hyphen used in ISO 3166-2 syntax is not necessary to separate the two parts.
Provide guidelines and constraints for the use of the TAG sequences, to help ensure stable and non-redundant representation of regions and regional subdivisions:
- UTR #51 already specifies that only those codes valid for the LDML unicode_region_subtag are used with regional indicators.
  - This prevents multiple representation of many regions which have both an ISO 3166-1 code and a UN M49 code, and provides management of code deprecation, etc.).
  - Note that this allows the “exceptionally reserved” code EU (European Union), but does not allow the “exceptionally reserved” code UN (United Nations).
- In CLDR 28 (September, 2015), LDML will define a unicode_subdivision_subtag, which also provides validity criteria for the codes used for regional subdivisions. These subdivision codes are based on ISO 3166-2, but also provide stability. Only valid LDML unicode_subdivision_subtags can be used to represent regional subdivision flags.
  - For example, ISO 3166-2 defines “GB-SCT” for Scotland, “GB-WLS” for Wales, “US-DE” for Delaware, and “NO-18” for Nordland (in Norway).
  - Note that a particular unicode_subdivision_subtag may be deprecated, but will not be removed, and thus forever remains valid (though discouraged).

C. Syntax

The syntax for well-formed subdivision flags is:

B((TL{2} (TL|TD){1,4}) | (TD{3} (TL|TD){1,4}?))

This uses the following notation:

B designates the chosen base character (U+1F3F3)

TL designates a TAG LATIN CAPITAL LETTER (A..Z)

TD designates a TAG DIGIT (ZERO..NINE)

Not all syntactically well-formed TAG sequences correspond to an actual flag—only a defined subset can be used.

D. Text break considerations

The TAG characters have general category value Cf and line break property value CM. Consequently, the proposed base character followed by a sequence of TAG characters is already treated as a unit for word, sentence, and line break. Grapheme break property values and rules would need some adjustment; until those are updated in UAX #29, implementations could use a tailored grapheme break to handle these correctly.

The proposal will add language to UTR #51 recommending that each REGIONAL INDICATOR pair used to designate a flag be followed by U+200C ZERO WIDTH NON-JOINER (ZWNJ) to facilitate text break. ZWNJ is in the Extend class for grapheme and word break, and will thus be included in a grapheme or word with the preceding REGIONAL INDICATORs.

E. Discussion

Note that TAG sequences could also be used to designate flags corresponding to two-letter unicode_region_subtags, using the base character followed by two TAG LATIN CAPITAL LETTERs. This alternative to the use of paired REGIONAL INDICATOR SYMBOL letters to designate unicode_region_subtags has better inherent behavior for text break. However, doing so would result in two possible representations for many flags, so is not recommended. Note that the TAG sequences do allow for 3-digit region codes for the case where ISO destabilizes codes, by allowing the use of the three digit forms from BCP47.
Instead of using U+1F3F3 WAVING WHITE FLAG as the base for a TAG sequence, an alternative possibility is encoding a new character, perhaps U+1F1E5 REGIONAL FLAG BASE. Encoding a new character would delay support for the desired flags until it could be encoded (and the character alone would still need some sort of representation as a flag), so is not recommended.
A special fallback appearance should be used for the base followed by any unsupported or invalid sequence of TAG characters. The recommended glyph for the fallback is U+1F3F3 WAVING WHITE FLAG in a dotted rectangle.
Use of UN M49 codes to designate flags for supra-national and international organizations requires additional guidelines. For many M49 codes that designate supra-national regions, there is no reasonable flag; for others there are several possibilities, but all may have some political issues. For example:
- Representing the UN flag: One possibility is to use region 001 (World). However, region 001 could be associated with many other organizations as well; and some people may have political objections to using the UN flag to represent the world.
It is not anticipated—by any means—that flags for all or even most subdivision codes would be supported. Many subdivisions don’t have flags, or don’t have widely recognized flags. We would expect that certainly initially, and perhaps long term, only a relatively small number of subdivision flags would be widely supported and deployed.

Appendix: Material for CLDR 28 LDML specification

The following material has already been added to a draft version of UTS #35, the Unicode LDML specification, for CLDR version 28; it may be refined before the release of CLDR 28 in September 2015. The subdivisionContainment data mentioned below will also be in the CLDR 28 file in subdivisions.xml. This preliminary material is included here for reference only and is not part of Public Review Issue 299; feedback on this preliminary material can be provided as described in the Status section of the UTS #35 draft version.

	EBNF	ABNF
...
`unicode_subdivision_subtag`	`= (alphanum{1,4} ;`	`= 1*4alphanum`
...

3.6.5 Subdivision Codes

The subdivision codes are based on ISO 3166-2 codes, which have 1..3 ASCII letters or digits. Like BCP47, CLDR needs stable codes, which are not guaranteed for ISO 3166-2 (nor have they been stable in the past).

CLDR thus adds 4-character sequences, also ASCII letters or digits, which can be used for stability. If an ISO 3166-2 code is removed, it remains valid (though marked as deprecated) in CLDR. If an ICU 3166-2 code is reused for a different subdivision (within the same region), then CLDR will define a new equivalent code using these 4-character sequences.

...

A unicode_subdivision_subtag is valid for a unicode_region_subtag only when the subdivisionContainment element contains a subgroup element where:

the type attribute value is that unicode_region_subtag, and
the contains attribute value contains the unicode_subdivision_subtag.
- The contains attribute value is a space-delimited set, and compared case-insensitively.

For example, the subdivision “ca” (and “CA”) is valid for the region “US” because of the following element:

<subgroup type="US" contains="… CA …"/>

...

`B`	designates the chosen base character (U+1F3F3)
`TL`	designates a TAG LATIN CAPITAL LETTER (A..Z)
`TD`	designates a TAG DIGIT (ZERO..NINE)