L2/08-220
Date: Tue, 13 May 2008
Source: Mark Davis
Subject: Deprecated character proposal
=====
Peter and I had the following action
B.14.2 Deprecated characters [Davis, L2/08-018]
The proposal is to post the following (with any amendments from the meeting) plus Table 1 as a PRI (that is, excluding Table 2 and 3).[114-A106] Action Item for Mark Davis, Peter Edberg: Prepare a proposal on deprecated characters for the next UTC meeting. See L2/08-018.
====
The characters listed in Table 1 below are discouraged or strongly discouraged in the Unicode Standard, either in the text or in the charts. The mechanism for making that status known to implementers is by giving them the property Deprecated, so the proposal is to add them to the set of characters with that property. If after discussion and review, any of these should not be given the property Deprecated, then the phrasing about their being discouraged should be removed from the text of the Standard and charts.
As part of this proposal, we would add text that makes the following points more clearly.
- Deprecated doesn't mean removed, just strongly discouraged -- one could still get them from character encoding converters, for example.
- Whenever implementations require roundtripping to legacy encodings, deprecated characters should not be transformed or filtered -- the same as with normalization.
- In other circumstances, deprecated characters should be avoided where possible. Where there is a preferred alternative, it should be used instead.
- NFC (and to a lesser degree, NFKC) effectively discourages singletons, and certain others, so they are effectively deprecated.
Table 1. Characters to be given the Deprecated property
The list of proposed characters is broken down according to their status vis-a-vis NFC and NFKC.
Allowed by both NFC and NFKC:
0953
( ॓ ) DEVANAGARI GRAVE ACCENT
0954
( ॔ ) DEVANAGARI ACUTE ACCENT
0F07
( ༇ ) TIBETAN MARK YIG MGO TSHEG SHAD MA
17A4
( ឤ ) KHMER INDEPENDENT VOWEL QAA
17D8
( ៘ ) KHMER SIGN BEYYAL
20A4
( ₤ ) LIRA SIGN
Allowed by NFC but not NFKC:
0149
( ʼn ) LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
0F77
TIBETAN VOWEL SIGN VOCALIC RR
0F79
TIBETAN VOWEL SIGN VOCALIC LL
Allowed by neither NFC nor NFKC:
0344
( ̈́ ) COMBINING GREEK DIALYTIKA TONOS
037E
( ; ) GREEK QUESTION MARK
0387
( · ) GREEK ANO TELEIA
0F73
TIBETAN VOWEL SIGN II
0F75
TIBETAN VOWEL SIGN UU
0F81
TIBETAN VOWEL SIGN REVERSED II
2126
( Ω ) OHM SIGN
212A
( K ) KELVIN SIGN
212B
( Å ) ANGSTROM SIGN
2329
( 〈 ) LEFT-POINTING ANGLE BRACKET
232A
( 〉 ) RIGHT-POINTING ANGLE BRACKET
For comparison, the Deprecated characters as of U5.1 are:
U+0340
( ̀ ) COMBINING GRAVE TONE MARK
U+0341
( ́ ) COMBINING ACUTE TONE MARK
U+17A3
( ឣ ) KHMER INDEPENDENT VOWEL QAQ
U+17D3
( ៓ ) KHMER SIGN BATHAMASAT
U+206A
( ) INHIBIT SYMMETRIC SWAPPING
U+206B
( ) ACTIVATE SYMMETRIC SWAPPING
U+206C
( ) INHIBIT ARABIC FORM SHAPING
U+206D
( ) ACTIVATE ARABIC FORM SHAPING
U+206E
( ) NATIONAL DIGIT SHAPES
U+206F
( ) NOMINAL DIGIT SHAPES
U+E0001
( ) LANGUAGE TAG
U+E0020
( ) TAG SPACE
...
U+E007F
( ) CANCEL TAG
===================================================
Table 2. Current text on 'deprecated'
We currently have the following text on deprecated:===================================================
- p23 Characters are retained in the standard, so that previously conforming data stay conform-
ant in future versions of the standard. Sometimes characters are deprecated—that is, their
use in new documents is discouraged. Usually, this is because the characters were found not
to be needed, and their continued use would merely result in duplicate ways of encoding
the same information. While implementations should continue to recognize such charac-
ters when they are encountered, spell-checkers or editors could warn users of their presence
and suggest replacements.- p24 The fact that a character can be considered a compatibility variant does not mean that the
character is deprecated in the standard. The use of many compatibility variants in general
interchange is unproblematic. Some, however, such as Arabic contextual forms or vertical
forms, can lead to problems when used in general interchange. In identifiers, compatibility
variants should be avoided because of their visual similarity with regular characters. (See
Unicode Technical Report #36, "Unicode Security Considerations.")
...
For example, the deprecated alternate format characters do not
have any distinct decomposition, and CJK compatibility ideographs have canonical
decomposition mappings rather than compatibility decomposition mappings.
- p66 Characters may be deprecated, but this does not remove them from the standard or from
existing data. The code point for a deprecated character will never be reassigned to a differ-
ent character, but the use of a deprecated character is strongly discouraged. Generally these
rules make the encoded characters of a new version backward-compatible with previous
versions.- p88 D13 Deprecated character: A coded character whose use is strongly discouraged. Such
characters are retained in the standard, but should not be used.
• Deprecated characters are retained in the standard so that previously conform-
ing data stay conformant in future versions of the standard. Deprecated charac-
ters should not be confused with obsolete characters, which are historical.
Obsolete characters do not occur in modern text, but they are not deprecated;
their use is not discouraged.
- 5.1.0 Other paired stateful controls in the standard are deprecated, and their use should be avoided....
These characters are deprecated, and should not be used—particularly with any protocols that provide alternate means of language tagging....
Table 3. Remarks from Lofting
Appended is some feedback from Peter Lofting for consideration regarding the Tibetan characters.
[0] History of deprecationDiscouragement notices have been in place in the Tibetan Unicode charts since 3.0(a) for di-graphs the notice: "use of this character is discouraged"(b) for tri-graphs the notice: "use of this character is strongly discouraged"
[1] head letter0F07 TIBETAN MARK YIG MGO TSHEG SHAD MAI would like to know the basis for deprecating this head mark.It exists in documents and a canonical decomposition is not possible. Why single it out?[2] Sanskrit accents
0F73 TIBETAN VOWEL SIGN II -> canon decomp 0F71 0F72
0F75 TIBETAN VOWEL SIGN UU -> canon decomp 0F71 0F74
0F77 TIBETAN VOWEL SIGN VOCALIC RR -> compat decomp 0FB2 0F81
0F79 TIBETAN VOWEL SIGN VOCALIC LL -> compat decomp 0FB3 0F81
0F81 TIBETAN VOWEL SIGN REVERSED II -> canon decomp 0F71 0F80
Canonical decomposition is not the only relationship that these characters have: The Tibetan double vowel marks in the list are used for representing Sanskrit transliterated into Tibetan and enable the disambiguation of such text from Tibetan contraction sequences for both shaping and semantic processing. This is an important function and these code points are not therefore redundant.
They also map 1:1 to Sanskrit vowels in the Indic code pages. e.g. 0F73 TIBETAN VOWEL SIGN II --> 0908 DEVANAGARI LETTER II
=====
I would also add that the selection of these candidates is consistent only with their awkwardness for shaping machinery rather than consistent application of occams razor. There are other less useful - and even plain wrong code points in the block that could much more reasonably be decomposed or deprecated, but they appear to have escaped attention because they are "well behaved" complete precomposed stacks. e.g.
0F00 TIBETAN SYLLABLE OM --> decomp 0F68 0F7C 0F7EThis one is in as a 1:1 mapping to Devanagari OM at 0950, but can be decomposed without loss of representation. It is just a precomposed display form.
0F02 TIBETAN MARK GTER YIG MGO -UM RNAM BCAD MA --> decomp 0F60 0F74 0F7F 0F82This and 0F03 are only two instances of an open-ended class of many Terma head marks. They only make sense in the encoding as generic place-holders for a whole set of marks that could then be represented with variant selector sequences using these two base bytes.
If a terma mark were encoded then 0F03 could also be decomposed.Depending on scholarly input, Terma mark might be represented as a display variant of 0F82, in which case...0F03 TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA --> decomp 0F60 0F74 0F7F <0F82 terma mark>
In the plain wrong department, the 'Digits minus half' section needs correcting. It should read "Half Digits" or some such.
There are 2 key problems:(i) The slash divides the value into half NOT minus one half.(ii) The slash can apply to multi-digit sequences e.g. 108<slashed> = 54, etc
As the character names stand they are not wrong as they say HALF FiVE etc. rather than FIVE MINUS ONE HALF. The exception is 0F33 TIBETAN DIGIT HALF ZERO which is a "divide by zero" error which gives the lie to the bad semantic definition. Depending on how this mess is cleaned up these could be candidates for deprecation. The right way to represent these cases is with a separate combining slash mark. These half digits could then be deprecated as display forms; but again a combining slash of variable scope is awkward for both shaping and computation, and is why the corrupted definition was invented in the first place in an effort to avoid such awkward shaping and processing requirements. I expect it will take another 5-to-10 years for other code points in the combining marks block to force this kind of mechanism into being, at which point, this can be corrected "at no extra cost" to implementers.