Deprecated Character Proposal

Date: Tue, 13 May 2008
Source: Mark Davis
Subject: Deprecated character proposal

=====

Peter and I had the following action

B.14.2 Deprecated characters [Davis, L2/08-018]

[114-A106] Action Item for Mark Davis, Peter Edberg: Prepare a proposal on deprecated characters for the next UTC meeting. See L2/08-018.

The proposal is to post the following (with any amendments from the meeting) plus Table 1 as a PRI (that is, excluding Table 2 and 3).

====

The characters listed in Table 1 below are discouraged or strongly discouraged in the Unicode Standard, either in the text or in the charts. The mechanism for making that status known to implementers is by giving them the property Deprecated, so the proposal is to add them to the set of characters with that property. If after discussion and review, any of these should not be given the property Deprecated, then the phrasing about their being discouraged should be removed from the text of the Standard and charts.

As part of this proposal, we would add text that makes the following points more clearly.

Deprecated doesn't mean removed, just strongly discouraged -- one could still get them from character encoding converters, for example.
Whenever implementations require roundtripping to legacy encodings, deprecated characters should not be transformed or filtered -- the same as with normalization.
In other circumstances, deprecated characters should be avoided where possible. Where there is a preferred alternative, it should be used instead.
NFC (and to a lesser degree, NFKC) effectively discourages singletons, and certain others, so they are effectively deprecated.

Table 1. Characters to be given the Deprecated property

The list of proposed characters is broken down according to their status vis-a-vis NFC and NFKC.

Allowed by both NFC and NFKC:

( ॓ ) DEVANAGARI GRAVE ACCENT

( ॔ ) DEVANAGARI ACUTE ACCENT


		
		0F07

( ༇ ) TIBETAN MARK YIG MGO TSHEG SHAD MA


		
		17A4

( ឤ ) KHMER INDEPENDENT VOWEL QAA


		
		17D8

( ៘ ) KHMER SIGN BEYYAL


		
		20A4

( ₤ ) LIRA SIGN

Allowed by NFC but not NFKC:

( ŉ ) LATIN SMALL LETTER N PRECEDED BY APOSTROPHE


		
		0F77

TIBETAN VOWEL SIGN VOCALIC RR


		
		0F79

TIBETAN VOWEL SIGN VOCALIC LL

Allowed by neither NFC nor NFKC:

( ̈́ ) COMBINING GREEK DIALYTIKA TONOS


		
		037E

( ; ) GREEK QUESTION MARK

( · ) GREEK ANO TELEIA


		
		0F73

TIBETAN VOWEL SIGN II


		
		0F75

TIBETAN VOWEL SIGN UU


		
		0F81

TIBETAN VOWEL SIGN REVERSED II

( Ω ) OHM SIGN


		
		212A

( K ) KELVIN SIGN


		
		212B

( Å ) ANGSTROM SIGN

( 〈 ) LEFT-POINTING ANGLE BRACKET


		
		232A

( 〉 ) RIGHT-POINTING ANGLE BRACKET

For comparison, the Deprecated characters as of U5.1 are:


		
		U+0340

( ̀ ) COMBINING GRAVE TONE MARK


		
		U+0341

( ́ ) COMBINING ACUTE TONE MARK


		
		U+17A3

( ឣ ) KHMER INDEPENDENT VOWEL QAQ


		
		U+17D3

( ៓ ) KHMER SIGN BATHAMASAT


		
		U+206A

( ) INHIBIT SYMMETRIC SWAPPING


		
		U+206B

( ) ACTIVATE SYMMETRIC SWAPPING


		
		U+206C

( ) INHIBIT ARABIC FORM SHAPING


		
		U+206D

( ) ACTIVATE ARABIC FORM SHAPING


		
		U+206E

( ) NATIONAL DIGIT SHAPES


		
		U+206F

( ) NOMINAL DIGIT SHAPES


		
		U+E0001

( ) LANGUAGE TAG


		
		U+E0020

( ) TAG SPACE
...


		
		U+E007F

( ) CANCEL TAG

===================================================

Table 2. Current text on 'deprecated'

We currently have the following text on deprecated:

p23 Characters are retained in the standard, so that previously conforming data stay conform-
ant in future versions of the standard. Sometimes characters are deprecated�that is, their
use in new documents is discouraged. Usually, this is because the characters were found not
to be needed, and their continued use would merely result in duplicate ways of encoding
the same information. While implementations should continue to recognize such charac-
ters when they are encountered, spell-checkers or editors could warn users of their presence
and suggest replacements.
p24 The fact that a character can be considered a compatibility variant does not mean that the
character is deprecated in the standard. The use of many compatibility variants in general
interchange is unproblematic. Some, however, such as Arabic contextual forms or vertical
forms, can lead to problems when used in general interchange. In identifiers, compatibility
variants should be avoided because of their visual similarity with regular characters. (See
Unicode Technical Report #36, "Unicode Security Considerations.")
...
For example, the deprecated alternate format characters do not
have any distinct decomposition, and CJK compatibility ideographs have canonical
decomposition mappings rather than compatibility decomposition mappings.
p66 Characters may be deprecated, but this does not remove them from the standard or from
existing data. The code point for a deprecated character will never be reassigned to a differ-
ent character, but the use of a deprecated character is strongly discouraged. Generally these
rules make the encoded characters of a new version backward-compatible with previous
versions.
p88 D13 Deprecated character: A coded character whose use is strongly discouraged. Such
characters are retained in the standard, but should not be used.
� Deprecated characters are retained in the standard so that previously conform-
ing data stay conformant in future versions of the standard. Deprecated charac-
ters should not be confused with obsolete characters, which are historical.
Obsolete characters do not occur in modern text, but they are not deprecated;
their use is not discouraged.
5.1.0 Other paired stateful controls in the standard are deprecated, and their use should be avoided....
These characters are deprecated, and should not be used�particularly with any protocols that provide alternate means of language tagging....

===================================================

Table 3. Remarks from Lofting

Appended is some feedback from Peter Lofting for consideration regarding the Tibetan characters.

[0] History of deprecation

Discouragement notices have been in place in the Tibetan Unicode charts since 3.0

(a) for di-graphs the notice: "use of this character is discouraged"

(b) for tri-graphs the notice: "use of this character is strongly discouraged"

[1] head letter

0F07 TIBETAN MARK YIG MGO TSHEG SHAD MA

I would like to know the basis for deprecating this head mark.

It exists in documents and a canonical decomposition is not possible. Why single it out?

[2] Sanskrit accents
0F73 TIBETAN VOWEL SIGN II -> canon decomp 0F71 0F72
0F75 TIBETAN VOWEL SIGN UU -> canon decomp 0F71 0F74
0F77 TIBETAN VOWEL SIGN VOCALIC RR -> compat decomp 0FB2 0F81
0F79 TIBETAN VOWEL SIGN VOCALIC LL -> compat decomp 0FB3 0F81
0F81 TIBETAN VOWEL SIGN REVERSED II -> canon decomp 0F71 0F80

Canonical decomposition is not the only relationship that these characters have: The Tibetan double vowel marks in the list are used for representing Sanskrit transliterated into Tibetan and enable the disambiguation of such text from Tibetan contraction sequences for both shaping and semantic processing. This is an important function and these code points are not therefore redundant.

They also map 1:1 to Sanskrit vowels in the Indic code pages. e.g. 0F73 TIBETAN VOWEL SIGN II --> 0908 DEVANAGARI LETTER II

=====

I would also add that the selection of these candidates is consistent only with their awkwardness for shaping machinery rather than consistent application of occams razor. There are other less useful - and even plain wrong code points in the block that could much more reasonably be decomposed or deprecated, but they appear to have escaped attention because they are "well behaved" complete precomposed stacks. e.g.

0F00 TIBETAN SYLLABLE OM --> decomp 0F68 0F7C 0F7E

This one is in as a 1:1 mapping to Devanagari OM at 0950, but can be decomposed without loss of representation. It is just a precomposed display form.

0F02 TIBETAN MARK GTER YIG MGO -UM RNAM BCAD MA --> decomp 0F60 0F74 0F7F 0F82

This and 0F03 are only two instances of an open-ended class of many Terma head marks. They only make sense in the encoding as generic place-holders for a whole set of marks that could then be represented with variant selector sequences using these two base bytes.

If a terma mark were encoded then 0F03 could also be decomposed.

Depending on scholarly input, Terma mark might be represented as a display variant of 0F82, in which case...

0F03 TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA --> decomp 0F60 0F74 0F7F <0F82 terma mark>

In the plain wrong department, the 'Digits minus half' section needs correcting. It should read "Half Digits" or some such.

There are 2 key problems:

(i) The slash divides the value into half NOT minus one half.

(ii) The slash can apply to multi-digit sequences e.g. 108<slashed> = 54, etc

As the character names stand they are not wrong as they say HALF FiVE etc. rather than FIVE MINUS ONE HALF. The exception is 0F33 TIBETAN DIGIT HALF ZERO which is a "divide by zero" error which gives the lie to the bad semantic definition. Depending on how this mess is cleaned up these could be candidates for deprecation. The right way to represent these cases is with a separate combining slash mark. These half digits could then be deprecated as display forms; but again a combining slash of variable scope is awkward for both shaping and computation, and is why the corrupted definition was invented in the first place in an effort to avoid such awkward shaping and processing requirements. I expect it will take another 5-to-10 years for other code points in the combining marks block to force this kind of mechanism into being, at which point, this can be corrected "at no extra cost" to implementers.