From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Fri Nov 01 2002 - 00:36:46 EST
Kenneth Whistler wrote the following.
>I think Marku's suggestion is correct. If you want to do
>something like this internally to a process, use a noncharacter
>code point for it. If you want to have visible display of this
>kind of error handling for conversion, then simply declare a
>convention for the use of an already existing character.
>My suggestion would be: U+2620. ;-) Then get people to share
>your convention.
I find this suggestion curious, particularly coming as it does from an
officer of the Unicode Corporation.
The U2600.pdf file has U+2620 under Warning signs and has = poison in its
description.
Suppose for example that the source document encoded in UTF-8 is a document
about chemicals found around the house and that the U+2620 character is used
to indicate those which are poisonous. If U+2620 is also used to include in
visible form an indication of an error found during decoding, then finding a
U+2620 character in the decoded document would lead to an ambiguous
situation.
One solution would be for the Unicode Consortium to encode an otherwise
unused character especially for the purpose.
If, however, the way forward is for an individual to declare a convention,
then I suggest that a sequence of at least two characters, the first being a
base character and the one or more others being combining items be used so
as to produce an otherwise highly unlikely sequence of characters.
For example, the character U+0304 COMBINING MACRON could be a good choice,
as it could be used to indicate a Boolean "not" condition with a character
which is otherwise unlikely to carry an accent.
As to which character to use for the base character, I am undecided, however
it should, in my opinion, not be U+2620 as that is a warning sign meaning
poison and could lead to confusion if looking at a document.
The advantage of a two character sequence is that a special piece of
software may be used to parse all incoming documents. Only occurrences of
the otherwise highly unlikely sequence will be regarded as indicating a
conversion problem with the encoding. If either of the two characters used
for the sequence is encountered other than with the rest of the sequence,
then it will not indicate the special effect.
In my comet circumflex system I use a three character detection sequence.
This means that in order to enter the markup universe then all three
characters of the sequence need to be present in sequence. Thus, a piece of
software can scan all incoming text messages, even those which are not
designed to fit in with the comet circumflex system, and not indicate a
comet circumflex message if, say, a U+2604 COMET character arrives as part
of a message.
Using a two or three character sequence which is otherwise highly unlikely
to occur is, in my opinion, a good way to indicate the presence of a special
feature as it allows one to monitor all text files for the special feature
without causing undesired responses on text files which have been prepared
without any regard to the special feature.
I feel that the influence of posting a suggestion in this mailing list is
often greatly underestimated. If you do post a suggested two or three
character sequence for the purpose that you seek, perhaps, if you wish,
after further discussion in this group, my feeling is that that sequence may
well become well known and accepted for the purpose very quickly, simply
because where there is a need for such a sequence then, in the absence of
any good reason not to do so, people will often happily use the suggested
format.
William Overington
1 November 2002
This archive was generated by hypermail 2.1.5 : Fri Nov 01 2002 - 01:26:44 EST