Re: New Charakter Proposal

From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Fri Nov 01 2002 - 00:36:46 EST

  • Next message: Tex Texin: "Re: New Charakter Proposal"

    Kenneth Whistler wrote the following.

    >I think Marku's suggestion is correct. If you want to do
    >something like this internally to a process, use a noncharacter
    >code point for it. If you want to have visible display of this
    >kind of error handling for conversion, then simply declare a
    >convention for the use of an already existing character.
    >My suggestion would be: U+2620. ;-) Then get people to share
    >your convention.

    I find this suggestion curious, particularly coming as it does from an
    officer of the Unicode Corporation.

    The U2600.pdf file has U+2620 under Warning signs and has = poison in its
    description.

    Suppose for example that the source document encoded in UTF-8 is a document
    about chemicals found around the house and that the U+2620 character is used
    to indicate those which are poisonous. If U+2620 is also used to include in
    visible form an indication of an error found during decoding, then finding a
    U+2620 character in the decoded document would lead to an ambiguous
    situation.

    One solution would be for the Unicode Consortium to encode an otherwise
    unused character especially for the purpose.

    If, however, the way forward is for an individual to declare a convention,
    then I suggest that a sequence of at least two characters, the first being a
    base character and the one or more others being combining items be used so
    as to produce an otherwise highly unlikely sequence of characters.

    For example, the character U+0304 COMBINING MACRON could be a good choice,
    as it could be used to indicate a Boolean "not" condition with a character
    which is otherwise unlikely to carry an accent.

    As to which character to use for the base character, I am undecided, however
    it should, in my opinion, not be U+2620 as that is a warning sign meaning
    poison and could lead to confusion if looking at a document.

    The advantage of a two character sequence is that a special piece of
    software may be used to parse all incoming documents. Only occurrences of
    the otherwise highly unlikely sequence will be regarded as indicating a
    conversion problem with the encoding. If either of the two characters used
    for the sequence is encountered other than with the rest of the sequence,
    then it will not indicate the special effect.

    In my comet circumflex system I use a three character detection sequence.
    This means that in order to enter the markup universe then all three
    characters of the sequence need to be present in sequence. Thus, a piece of
    software can scan all incoming text messages, even those which are not
    designed to fit in with the comet circumflex system, and not indicate a
    comet circumflex message if, say, a U+2604 COMET character arrives as part
    of a message.

    Using a two or three character sequence which is otherwise highly unlikely
    to occur is, in my opinion, a good way to indicate the presence of a special
    feature as it allows one to monitor all text files for the special feature
    without causing undesired responses on text files which have been prepared
    without any regard to the special feature.

    I feel that the influence of posting a suggestion in this mailing list is
    often greatly underestimated. If you do post a suggested two or three
    character sequence for the purpose that you seek, perhaps, if you wish,
    after further discussion in this group, my feeling is that that sequence may
    well become well known and accepted for the purpose very quickly, simply
    because where there is a need for such a sequence then, in the absence of
    any good reason not to do so, people will often happily use the suggested
    format.

    William Overington

    1 November 2002



    This archive was generated by hypermail 2.1.5 : Fri Nov 01 2002 - 01:26:44 EST