In a message dated 2002-01-21 5:20:55 Pacific Standard Time,
marco.cimarosti@essetre.it writes:
> Doug Ewell wrote:
>> Devanagari text encoded in SCSU occupies exactly 1 byte per
>> character, plus an additional byte near the start of the
>> file to set the current window (0x14 = SC4).
>
> The problem is what happens if that very byte gets corrupted for any
> reason...
>
> If an octet is erroneously deleted, changed or added from an UTF-8 stream,
> only a single character would be corrupted. If the same thing happens to the
> window-setting byte of a SCSU (or other similar "zany" formats), the whole
> stream turns into garbage.
Yes, SCSU is stateful and the corruption of a single tag, or argument to a
tag, could potentially damage large amounts of text. I know this was a big
problem in the days of devices and transmission protocols that did little or
no error correction. I honestly don't know how big a problem it is today.
> What this means in practice for website developers is:
>
> 1) SCSU text can only be edited with a text editor which properly decodes
> the *whole* file on load and re-encodes it on save. On the other hand, UTF-8
> text can also be edited using an encoding-unaware editor, although non-ASCII
> text is invisible.
I have edited SCSU text using a completely encoding-ignorant MS-DOS editor.
Of course I couldn't edit the SCSU control bytes intelligently, but then I
can't edit multibyte UTF-8 sequences intelligently with it either.
> 2) SCSU text cannot be built by assembling binary pieces coming from
> external sources. E.g., you cannot get a SCSU-encoded template file and fill
> in the blanks with customer data coming from a SCSU-encoded database: each
> time you insert a piece of text coming from the database, you delete the
> current window information, turning into garbage the rest of the file.
The current window information is not deleted, it is carried over into any
adjoining text that does not redefine it. (This could have its own
repercussions, of course.)
> 3) A SCSU page can only be accepted by browsers and e-mail readers that are
> able to decode it. On the other hand, UTF-8 also works on old ASCII-based
> browsers, although non-ASCII text is clearly not properly displayed.
Same as 1). If you have only ASCII text, SCSU == UTF-8 == ASCII, and if you
have non-ASCII text, both SCSU and UTF-8 encode that text with byte sequences
that readers must know how to decode. SCSU does use states, like any
compression scheme, so an encoding-ignorant tool will probably have more
trouble with SCSU than with UTF-8. But I was not arguing to foist SCSU on an
unprepared world, I was suggesting that the world should prepare. \u263a
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.2 : Mon Jan 21 2002 - 11:12:03 EST