RE: Devanagari

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Mon Jan 21 2002 - 08:20:17 EST


Doug Ewell wrote:
> Devanagari text encoded in SCSU occupies exactly 1 byte per
> character, plus an additional byte near the start of the
> file to set the current window (0x14 = SC4).

The problem is what happens if that very byte gets corrupted for any
reason...

If an octet is erroneously deleted, changed or added from an UTF-8 stream,
only a single character would be corrupted. If the same thing happens to the
window-setting byte of a SCSU (or other similar "zany" formats), the whole
stream turns into garbage.

What this means in practice for website developers is:

1) SCSU text can only be edited with a text editor which properly decodes
the *whole* file on load and re-encodes it on save. On the other hand, UTF-8
text can also be edited using an encoding-unaware editor, although non-ASCII
text is invisible.

2) SCSU text cannot be built by assembling binary pieces coming from
external sources. E.g., you cannot get a SCSU-encoded template file and fill
in the blanks with customer data coming from a SCSU-encoded database: each
time you insert a piece of text coming from the database, you delete the
current window information, turning into garbage the rest of the file. On
the other hand, UTF-8 allows this, provided that the integrity of each
multi-byte sequence is maintained.

3) A SCSU page can only be accepted by browsers and e-mail readers that are
able to decode it. On the other hand, UTF-8 also works on old ASCII-based
browsers, although non-ASCII text is clearly not properly displayed.

_ Marco



This archive was generated by hypermail 2.1.2 : Mon Jan 21 2002 - 07:47:08 EST