PRI #299 Background: Representing Additional Types of Flags
Last updated: 2015-06-29
A. Background
The Unicode Standard already provides a mechanism to represent flags
using pairs of REGIONAL INDICATOR symbols U+1F1E6..U+1F1FF, which
were added in Unicode 6.0. The mechanism is documented in the current
text of the standard and covered in
Annex B
of UTR #51.
On several systems, pairs of REGIONAL INDICATOR symbols are used to
represent up to more than 200 flags as emoji. These pairs correspond to
unicode_region_subtag
two-letter codes, which can represent some regions such as Isle of Man,
Guernsey, and Puerto Rico but not others, such as England, Scotland, Wales,
U.S. states, or Russian Federation republics.
The unicode_region_subtags defined in CLDR are based on
BCP47,
which is in turn based on ISO 3166-1 and
UN M49
codes.
On some platforms that support a number of emoji flags, there is
substantial demand to support additional flags beyond those defined for
unicode_region_subtags, such as for the following:
- “Country subdivisions” such as England, Scotland, Wales, U.S. states, and Russian Federation republics.
- Certain supra-national organizations such the United Nations.
- Certain 3 digit regional codes, used in BCP47 to stabilize ISO 3166-1 codes.
B. Proposal
This proposal describes a mechanism for representing
unicode_region_subtags
and
unicode_subdivision_subtags
with TAG characters for designating flags. The proposal will extend
UTR #51
and includes several parts:
- Use the TAG characters E0030..E0039 (TAG DIGITs) and E0041..E005A
(TAG LATIN CAPITAL LETTERs). These TAG characters are default-ignorable,
without any visible representation by themselves.
- Designate the character U+1F3F3 WAVING WHITE
FLAG as the “base” for a subsequent sequence of TAG characters. This
character already encoded, with general category value So.
- This base character is a visible spacing character that
suggests a flag, so that implementations that do not support
the TAG characters have an indication that a flag is present.
- Define valid sequences as the base character
followed by a sequence of TAG characters, as specified by either of the
following conditions:
- A sequence of three TAG DIGIT characters designating a
unicode_region_subtag.
- This is needed primarily to handle cases in which
ISO 3166-1 reassigns 2-letter codes that are already in
the BCP 47 registry. It may also be used to designate
flags for certain supra-national organizations. See
the discussion below.
- A sequence of TAG characters that correspond to a valid
unicode_region_subtag followed by a unicode_subdivision_subtag,
where the latter is valid for that region.
- The hyphen used in ISO 3166-2 syntax is not necessary
to separate the two parts.
- Provide guidelines and constraints for the
use of the TAG sequences, to help ensure stable and non-redundant
representation of regions and regional subdivisions:
- UTR #51 already specifies that only those codes valid for the
LDML unicode_region_subtag are used with regional indicators.
- This prevents multiple representation of many regions
which have both an ISO 3166-1 code and a UN M49 code,
and provides management of code deprecation, etc.).
- Note that this allows the “exceptionally reserved”
code EU (European Union), but does not allow the
“exceptionally reserved” code UN (United Nations).
- In CLDR 28 (September, 2015), LDML will define a
unicode_subdivision_subtag, which also provides validity criteria
for the codes used for regional subdivisions. These subdivision
codes are based on
ISO 3166-2,
but also provide stability. Only valid LDML
unicode_subdivision_subtags can be used to represent regional
subdivision flags.
- For example, ISO 3166-2 defines “GB-SCT” for Scotland,
“GB-WLS” for Wales, “US-DE” for Delaware, and “NO-18”
for Nordland (in Norway).
- Note that a particular unicode_subdivision_subtag may
be deprecated, but will not be removed, and thus
forever remains valid (though discouraged).
C. Syntax
The syntax for well-formed subdivision flags is:
B((TL{2} (TL|TD){1,4}) | (TD{3} (TL|TD){1,4}?))
This uses the following notation:
B | designates the chosen base character (U+1F3F3) |
TL | designates a TAG LATIN CAPITAL LETTER (A..Z) |
TD | designates a TAG DIGIT (ZERO..NINE) |
Not all syntactically well-formed TAG sequences correspond to an actual
flag—only a defined subset can be used.
D. Text break considerations
The TAG characters have general category value Cf and line break property
value CM. Consequently, the proposed base character followed by a sequence
of TAG characters is already treated as a unit for word, sentence, and
line break. Grapheme break property values and rules would need some
adjustment; until those are updated in
UAX #29,
implementations could use a tailored grapheme break to handle these
correctly.
The proposal will add language to UTR #51 recommending that each
REGIONAL INDICATOR pair used to designate a flag be followed by
U+200C ZERO WIDTH NON-JOINER (ZWNJ) to facilitate text break.
ZWNJ is in the Extend class for grapheme and word break, and will
thus be included in a grapheme or word with the preceding
REGIONAL INDICATORs.
E. Discussion
- Note that TAG sequences could also be used to designate
flags corresponding to two-letter
unicode_region_subtags,
using the base character followed by two TAG LATIN CAPITAL LETTERs.
This alternative to the use of paired REGIONAL INDICATOR SYMBOL
letters to designate unicode_region_subtags has better inherent
behavior for text break. However, doing so would result in two
possible representations for many flags, so is not recommended.
Note that the TAG sequences do allow for 3-digit region codes for
the case where ISO destabilizes codes, by allowing the use of
the three digit forms from BCP47.
- Instead of using U+1F3F3 WAVING WHITE FLAG
as the base for a TAG sequence, an alternative possibility is
encoding a new character, perhaps U+1F1E5 REGIONAL FLAG BASE.
Encoding a new character would delay support for the desired flags
until it could be encoded (and the character alone would still need
some sort of representation as a flag), so is not recommended.
- A special fallback appearance should be
used for the base followed by any unsupported or invalid sequence
of TAG characters. The recommended glyph for the fallback is
U+1F3F3 WAVING WHITE FLAG in a dotted rectangle.
- Use of UN M49 codes to designate flags
for supra-national and international organizations requires
additional guidelines. For many M49 codes that designate
supra-national regions, there is no reasonable flag; for others
there are several possibilities, but all may have some political
issues. For example:
- Representing the UN flag: One possibility is to use
region 001 (World). However, region 001 could be
associated with many other organizations as well;
and some people may have political objections to
using the UN flag to represent the world.
- It is not anticipated—by any means—that
flags for all or even most subdivision codes would be supported.
Many subdivisions don’t have flags, or don’t have widely
recognized flags. We would expect that certainly initially,
and perhaps long term, only a relatively small number of
subdivision flags would be widely supported and deployed.
Appendix: Material for CLDR 28 LDML specification
The following material has already been added to a
draft version
of UTS #35, the Unicode LDML specification, for CLDR version 28; it may be
refined before the release of CLDR 28 in September 2015. The subdivisionContainment
data mentioned below will also be in the CLDR 28 file in subdivisions.xml.
This preliminary material is included here for reference only and is not part of
Public Review Issue 299; feedback on this preliminary material can be provided as
described in the Status section of the UTS #35 draft version.
The subdivision codes are based on ISO 3166-2 codes, which
have 1..3 ASCII letters or digits. Like BCP47, CLDR needs stable codes, which
are not guaranteed for ISO 3166-2 (nor have they been stable in the past).
CLDR thus adds 4-character sequences, also ASCII letters or
digits, which can be used for stability. If an ISO 3166-2 code is removed,
it remains valid (though marked as deprecated) in CLDR. If an ICU 3166-2 code
is reused for a different subdivision (within the same region), then CLDR
will define a new equivalent code using these 4-character sequences.
...
A unicode_subdivision_subtag is valid for a
unicode_region_subtag
only when the subdivisionContainment element contains a
subgroup element where:
- the type attribute value is that
unicode_region_subtag,
and
- the contains attribute value contains the
unicode_subdivision_subtag.
- The contains attribute value is a space-delimited
set, and compared case-insensitively.
For example, the subdivision “ca” (and “CA”) is valid for the
region “US” because of the following element:
<subgroup type="US"
contains="… CA …"/>
...