PRI #122, Proposal for Additional Deprecated Characters

Public Review Issue #122

Proposal for Additional Deprecated Characters

The Unicode Technical Committee is considering giving a number of additional characters the Deprecated property.

The Deprecated property means that the use of the character is discouraged, and provides a machine-readable table for implementations. However, there are a number of characters that are marked as "discouraged" either in the text of the standard or in the names list, so the goal is to either add them to the set of characters with the Deprecated property, or if there is good reason not to, then remove the phrasing about their being discouraged. These are listed below in Table 1.

Table 2 provides additional characters that various people have proposed for Deprecation when this topic was discussed in the UTC. Table 3 gives the characters deprecated in U5.1, for comparison.

As part of this proposal, we would add text that makes the following points more clearly.

Deprecated does not mean removed, just discouraged -- one could still get them from character encoding converters, for example.
Whenever implementations require roundtripping to legacy encodings, deprecated characters should not be transformed or filtered -- the same as with normalization, especially NFKC.
In other circumstances, deprecated characters should be avoided where possible. Where there is a preferred alternative, it should be used instead. That is generally a normalized form, but not always (see below).
Certain characters cannot occur at all in text that is in normalization form NFC. This effectively discourages the use of those characters, but does not formally constitute deprecation, nor does this PRI suggest that they all be given the Deprecated property. For reference, there is a link to the list of those characters below.

The UTC would appreciate feedback on this proposal.

Table 1. Discouraged

Characters marked as discouraged (or "not encouraged") either in the name charts or text of the standard. Characters marked ** cannot occur in NFKC; those marked * cannot occur in NFC.
(http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0344\u037E\u0387\u0F73\u0F75\u0F77\u0F79\u0F81\u17A4\u17B4\u17D8\u20A4\u2126\u212A\u212B\u2329\u232A] )

0344 ( ̈́ ) COMBINING GREEK DIALYTIKA TONOS *
037E ( ; ) GREEK QUESTION MARK *
0387 ( · ) GREEK ANO TELEIA *
0F73 TIBETAN VOWEL SIGN II *
0F75 TIBETAN VOWEL SIGN UU *
0F77 TIBETAN VOWEL SIGN VOCALIC RR **
0F79 TIBETAN VOWEL SIGN VOCALIC LL **
0F81 TIBETAN VOWEL SIGN REVERSED II *
17A4 ( ឤ ) KHMER INDEPENDENT VOWEL QAA
17B4 KHMER VOWEL INHERENT AQ
17D8 ( ៘ ) KHMER SIGN BEYYAL
20A4 ( ₤ ) LIRA SIGN
2126 ( Ω ) OHM SIGN *
212A ( K ) KELVIN SIGN *
212B ( Å ) ANGSTROM SIGN *
2329 ( 〈 ) LEFT-POINTING ANGLE BRACKET *
232A ( 〉 ) RIGHT-POINTING ANGLE BRACKET *

The preferred forms for these are:
U+27E8 ( ⟨ ) MATHEMATICAL LEFT ANGLE BRACKET
U+27E9 ( ⟩ ) MATHEMATICAL RIGHT ANGLE BRACKET
while the NFC and NFKC forms are:
U+3008 ( 〈 ) LEFT ANGLE BRACKET
U+3009 ( 〉 ) RIGHT ANGLE BRACKET

Table 2. Additional Proposed Deprecations

Characters proposed to the UTC during discussion.
(http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0149\u0953\u0954\u0F07] )

0149 ( ŉ ) LATIN SMALL LETTER N PRECEDED BY APOSTROPHE **

The preferred form is 'n or ’n (with U+2019): this is but one of many such abbreviations in Dutch and Afrikaans, all of which are represented with apostrophe plus letter. The NFKC form does not match this preferred form, having U+02BC ( ʼ ) MODIFIER LETTER APOSTROPHE.

0953 ( ॓ ) DEVANAGARI GRAVE ACCENT
0954 ( ॔ ) DEVANAGARI ACUTE ACCENT
0F07 ( ༇ ) TIBETAN MARK YIG MGO TSHEG SHAD MA

Table 3. U5.1 Deprecated

For comparison, the following characters are Deprecated in U5.1.
(http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:deprecated:])

U+0340 ( ̀ ) COMBINING GRAVE TONE MARK
U+0341 ( ́ ) COMBINING ACUTE TONE MARK
U+17A3 ( ឣ ) KHMER INDEPENDENT VOWEL QAQ
U+17D3 ( ៓ ) KHMER SIGN BATHAMASAT
U+206A ( ) INHIBIT SYMMETRIC SWAPPING
U+206B ( ) ACTIVATE SYMMETRIC SWAPPING
U+206C ( ) INHIBIT ARABIC FORM SHAPING
U+206D ( ) ACTIVATE ARABIC FORM SHAPING
U+206E ( ) NATIONAL DIGIT SHAPES
U+206F ( ) NOMINAL DIGIT SHAPES
U+E0001 ( ) LANGUAGE TAG
U+E0020 ( ) TAG SPACE
...
U+E007F ( ) CANCEL TAG

Table 4. Characters not occurring in NFC

For comparison, the following characters cannot occur in NFC text.
(http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:nfc_quick_check=no:])

Table 5. Characters not occurring in NFKC

Characters that don't occur in NFKC are not closely related to deprecation, but for comparison they can be referenced through the following link:

(http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[:nfkc_quick_check=no:]-[:nfc_quick_check=no:]])