From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Oct 30 2002 - 09:46:21 EST
We had thought of something similar, but which would provide more
information in interfaces.
Reserve a space of 256 code points, with names:
UNCONVERTIBLE BYTE-00
UNCONVERTIBLE BYTE-01
...
UNCONVERTIBLE BYTE-FF
During a conversion process, if some bytes (say from corrupt UTF-8) cannot
be correctly converted into code points, then a sequence of the above are
generated. This doesn't preserve the original text -- you would never
convert back from these codepoints to anything; it is really only useful
ephemerally, in the process of doing a conversion where something goes
wrong. It is really only a slightly more verbose FFFD REPLACEMENT, but would
be handy in certain conversion APIs, expecially in
single-code-point-at-a-time API like getChar().
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
----- Original Message -----
From: "Dominikus Scherkl" <Dominikus.Scherkl@glueckkanja.com>
To: <unicode@unicode.org>
Sent: Wednesday, October 30, 2002 03:49
Subject: New Charakter Proposal
> Hello.
>
> I would like to have a "source failure indicator symbol" (SFIS)
> charakter in the unicode, which a charset-convertion unit may
> insert into a text (Suggeested position: U+FFF8).
>
> Reason:
> several charsets have undefined codepoints which were
> defined in a former or later version (eg. overlong
> UTF-8 encodings or the $ symbol (0x24) in the INVARIANT
> charset).
>
> A converter can replace such symbols by U+FFFD (which is
> correct but loses the information), or simply use the
> charakter which most likely is intended (which hides the error).
> Both is not very good.
>
> The SFIS would allow the reader to see that an error occured
> and therefore the following charakter may be incorrect, but
> maintain the readability if the right conversion is made anyway
> (or at least give a hint which charakter may be intended -
> eg. the $ sign could have been any other currency symbol
> if a national 7-bit charset was changed to INVARIANT by
> previous conversions).
>
> Of course a converter can still use U+FFFD if it has no
> idea which character is intended or if unicode doesn't contain
> the character.
>
>
> The whole "charakter identities"-discussion gave me another
> reason to introduce such a SFIS-charakter:
> A font-renderer may show the SFIS before a charakter which
> is replaced by another one because the correct one is not
> contained in the font (eg. it may render an "a with
> superscript e above" by SFIS + "a umlaut" to indcate the
> error and show an probably fitting replacement, which is
> much better than to show an empty square).
> In short words:
> The SFIS may indicate a kind of compatibility-decomposition
> of the following charakter.
> (this is not nessessarily the standard compatibility-decomposition).
>
> I'd like to hear if my suggestion is completely weird or
> if anybody else think it might be useful.
>
> Best Regards.
> --
> Dominikus Scherkl
> dominikus.scherkl@glueckkanja.com
>
>
This archive was generated by hypermail 2.1.5 : Wed Oct 30 2002 - 10:23:32 EST