Re: ISRI SoEuro has just been created!!

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Sep 07 2002 - 14:14:06 EDT


Robert Lloyd Wheelock wrote:

> Recently, I've created a brand-new 8-bit codepage for Windows and Mac
> systems called ISRISEO (International Symbolism Research
> Institute—Southern European), which is based on the codpages
> MS-CPW1254, ISO 8859-3, and ISO 8859-9. Very soon (I hope), you'll be
> able to input text easily in Maltese, Esperanto, ... even Azəri-Latin!

Instead of responding that the world does not need another 8-bit code
page -- which is probably true, and almost certainly true for the
languages Robert mentions, for which code pages already exist -- I'd
like to take a look at the design of Robert's ISRISEO and ask some
questions.

I consider ISRISEO to be like the experimental (some would say "joke")
UTF's that I, and Marco Cimarosti and Shlomi Tal and others, have
invented. They can be interesting from a laboratory standpoint, and can
help the inventor(s) learn something about the process of creating a
good encoding form/scheme. They are not necessarily intended to replace
the established mechanisms, although I suspect the ICU team does intend
BOCU to replace SCSU. So please, don't anybody misinterpret my serious
analysis of ISRISEO as an endorsement that it should be widely adopted,
implemented in Web browsers, etc.

ISRISEO is obviously intended to be at least partly compatible with
existing Windows code pages, based on the placement of graphic
characters in the 0x80-0x9F region and the attempt at Latin-1
compatibility in the 0xA0-0xFF range (not always achieved in ISO 8859-x,
for x > 1). Note the telltale U+20AC EURO SIGN at 0x80, for instance.

Given that, I am a bit surprised at some of the characters that have
been moved from their CP1252 locations for no obvious reason. Several
characters have been moved from the 0xA0-0xBF (Latin-1) range to the
0x80-0x9F range. The double angle quotation marks, U+00AB and U+00BB,
appear at 0xAD and 0xBD even though some of the target languages use
these quotation marks and might expect them in a "standard" location.

U+005E CIRCUMFLEX ACCENT is inexplicably duplicated at 0x88. Perhaps
this was an error and the intended character was U+02C6 MODIFIER LETTER
CIRCUMFLEX ACCENT. After all, other spacing accents are included in
ISRISEO. But are spacing accents really useful? This is a question I
have always had with regard to ISO 8859 parts and Windows code pages.
Wouldn't non-spacing accents be a better choice? If the target system
is incapable of rendering non-spacing marks, will spacing marks really
display the written language as its readers expect?

I don't understand why 7 code positions would be left undefined in
ISRISEO. By adding just a few more characters, Robert could claim
support for more languages without compromising support for his target
languages. For example, French is *almost* supported, and Spanish is
*almost" supported. At the very least, the infamous generic U+00A4
CURRENCY SIGN could have been assigned to 0xA4. A far more useful
assignment, though, would be U+00A0 NO-BREAK SPACE at 0xA0.

I also don't understand, given the recent discussion about the Catalan
middle dot, why there are assignments for both U+00B7 MIDDLE DOT (at
0xB7) and also U+013F and U+0140, the precomposed LATIN LETTER L WITH
MIDDLE DOT characters. I would think you'd only need one or the other,
and could reassign the positions used by the unnecessary characters to
improve the language coverage.

Just my 2₥,

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Sat Sep 07 2002 - 14:53:42 EDT