L2/08-287
Public Review Issue #122
Proposal for Additional Deprecated Characters
The Unicode Technical Committee is considering giving a number of additional characters the Deprecated property.
The
Deprecated property means that the use of the character is discouraged,
and provides a machine-readable table for implementations. However,
there are a number of characters that are marked as "discouraged"
either in the text of the standard or in the names list, so the goal is
to either add them to the set of characters with the Deprecated
property, or if there is good reason not to, then remove the phrasing
about their being discouraged. These are listed below in Table 1.
Table
2 provides additional characters that various people have proposed for
Deprecation when this topic was discussed in the UTC. Table 3 gives the
characters deprecated in U5.1, for comparison.
As part of this proposal, we would add text that makes the following points more clearly.
- Deprecated does not mean removed, just discouraged -- one could still get them from character encoding converters, for example.
- Whenever
implementations require roundtripping to legacy encodings, deprecated
characters should not be transformed or filtered -- the same as with
normalization, especially NFKC.
- In
other circumstances, deprecated characters should be avoided where
possible. Where there is a preferred alternative, it should be used
instead. That is generally a normalized form, but not always (see
below).
- Certain characters cannot
occur at all in text that is in normalization form NFC. This
effectively discourages the use of those characters, but does not
formally constitute deprecation, nor does this PRI suggest that they
all be given the Deprecated property. For reference, there is a link to the
list of those characters below.
The UTC would appreciate feedback on this proposal.
Table 1. Discouraged
Characters marked as discouraged (or "not encouraged") either in the
name charts or text of the standard. Characters marked ** cannot occur
in NFKC; those marked * cannot occur in NFC.
(http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0344\u037E\u0387\u0F73\u0F75\u0F77\u0F79\u0F81\u17A4\u17B4\u17D8\u20A4\u2126\u212A\u212B\u2329\u232A] )
0344 ( ̈́ ) COMBINING GREEK DIALYTIKA TONOS *
037E ( ; ) GREEK QUESTION MARK *
0387 ( · ) GREEK ANO TELEIA *
0F73 TIBETAN VOWEL SIGN II *
0F75 TIBETAN VOWEL SIGN UU *
0F77 TIBETAN VOWEL SIGN VOCALIC RR **
0F79 TIBETAN VOWEL SIGN VOCALIC LL **
0F81 TIBETAN VOWEL SIGN REVERSED II *
17A4 ( ឤ ) KHMER INDEPENDENT VOWEL QAA
17B4 KHMER VOWEL INHERENT AQ
17D8 ( ៘ ) KHMER SIGN BEYYAL
20A4 ( ₤ ) LIRA SIGN
2126 ( Ω ) OHM SIGN *
212A ( K ) KELVIN SIGN *
212B ( Å ) ANGSTROM SIGN *
2329 ( 〈 ) LEFT-POINTING ANGLE BRACKET *
232A ( 〉 ) RIGHT-POINTING ANGLE BRACKET *
The preferred forms for these are:
U+27E8 ( ⟨ ) MATHEMATICAL LEFT ANGLE BRACKET
U+27E9 ( ⟩ ) MATHEMATICAL RIGHT ANGLE BRACKET
while the NFC and NFKC forms are:
U+3008 ( 〈 ) LEFT ANGLE BRACKET
U+3009 ( 〉 ) RIGHT ANGLE BRACKET
Table 2. Additional Proposed Deprecations
Characters proposed to the UTC during discussion.
(http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0149\u0953\u0954\u0F07] )
0149 ( ʼn ) LATIN SMALL LETTER N PRECEDED BY APOSTROPHE **
The
preferred form is 'n or ’n (with U+2019): this is but one of many such
abbreviations in Dutch and Afrikaans, all of which are represented with
apostrophe plus letter. The NFKC form does not match this preferred
form, having U+02BC ( ʼ ) MODIFIER LETTER APOSTROPHE.
0953 ( ॓ ) DEVANAGARI GRAVE ACCENT
0954 ( ॔ ) DEVANAGARI ACUTE ACCENT
0F07 ( ༇ ) TIBETAN MARK YIG MGO TSHEG SHAD MA
Table 3. U5.1 Deprecated
For comparison, the following characters are Deprecated in U5.1.
(http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:deprecated:])
U+0340 ( ̀ ) COMBINING GRAVE TONE MARK
U+0341 ( ́ ) COMBINING ACUTE TONE MARK
U+17A3 ( ឣ ) KHMER INDEPENDENT VOWEL QAQ
U+17D3 ( ៓ ) KHMER SIGN BATHAMASAT
U+206A ( ) INHIBIT SYMMETRIC SWAPPING
U+206B ( ) ACTIVATE SYMMETRIC SWAPPING
U+206C ( ) INHIBIT ARABIC FORM SHAPING
U+206D ( ) ACTIVATE ARABIC FORM SHAPING
U+206E ( ) NATIONAL DIGIT SHAPES
U+206F ( ) NOMINAL DIGIT SHAPES
U+E0001 ( ) LANGUAGE TAG
U+E0020 ( ) TAG SPACE
...
U+E007F ( ) CANCEL TAG
Table 4.
Characters not occurring in NFC
For comparison, the following characters cannot occur in NFC text.
(http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:nfc_quick_check=no:])