L2/01-180

From: Markus Scherer [markus.scherer@jtcsv.com]

Sent: Tuesday, May 01, 2001 9:23 PM

Agenda: 3 proposals to extend the charset XML format

I have sent 3 proposals to the mapping list last week seeking additions to the charset XML format.

The goal is to be able to describe encodings that are currently not within the scope of the format, especially IBM and ISO 2022 encodings.

The feedback so far was only one email from Martin Hosken with questions.

I am appending the 3 emails here.

Subject: proposal for charset xml: subchar1

Subject: proposal: stateful (SI/SO) encoding - single xml mapping file

Subject: proposal for charset xml: iso-2022

Thanks,

markus

================================================================================

Subject: proposal for charset xml: subchar1

Date: Thu, 26 Apr 2001 15:12:24 -0700

Dear fellow charset mappers,

I would like to propose additional attributes and elements for the charset XML format to support an "interesting" feature in IBM conversion tables:

IBM mapping tables for multibyte codepages define an additional, alternate codepage substitution character which is always a single-byte code. In this case, the regular substitution character is always a double-byte code.

These mapping tables then also list in the mapping section which unassigned code points should map to this alternate subchar1 instead of to the regular subchar.

This is not used in IBM mapping tables for single-byte codepages.

** Proposal:

1. To add an optional attribute "sub1" to the <assignments> element.

2. To add an additional element <sub1> as a sub-element for <assignments>.

The <sub1> element should have "u", "c", and "v" attributes like <fub>.

In DTD:

<!ELEMENT assignments (a*, fub*, fbu*, sub1*, range*)>

<!ATTLIST assignments

sub NMTOKENS "1A"

sub1 NMTOKEN #IMPLIED

<!ELEMENT sub1 EMPTY>

<!ATTLIST sub1

u NMTOKENS #REQUIRED

c CDATA #IMPLIED

v CDATA #IMPLIED

** Usage of subchar1 in IBM mapping tables and converters:

The idea is, I think, that characters are "wide" or "narrow". In legacy codepages, this is identified with the codes being single-byte or double-byte codes.

In mappings between two legacy codepages:

When a wide (double-byte) character is unassigned, it results in a double-byte subchar. When a narrow (single-byte) character is unassigned, it results in a single-byte subchar1.

This is emulated in Unicode<->codepage mapping tables by

- declaring the additional subchar1, and by

- adding one-way mappings from Unicode to the codepage-subchar1

where desired for "narrow" characters;

also by

- using U+001a as a "Unicode subchar1"

Typically, all unassigned Latin-1 characters (Unicode<=U+00ff) have subchar1 mappings, but also some other code points do.

Examples from ibm-1363_P11B-2000.ucm (Korean),

subchar1 mappings are marked with "|2":

...

<subchar> \xA1\xE0

<subchar1> \x7F

...

CHARMAP

...

<U0080> \x7F |2

<U0081> \x7F |2

<U0082> \x7F |2

...

<U009F> \x7F |2

<U00A0> \x7F |2

<U00A1> \xA2\xAE |0

<U00A2> \x7F |2

<U00A3> \x7F |2

<U00A4> \xA2\xB4 |0

<U00A5> \x7F |2

...

<U00FD> \x7F |2

<U00FE> \xA9\xAD |0

<U00FF> \x7F |2

...

<U203E> \x7F |2

...

<UFFA0> \x7F |2

<UFFA1> \x7F |2

<UFFA2> \x7F |2

...

END CHARMAP

* This means that

when one converts from Unicode to such a codepage and finds an unassigned code point, then

- if a subchar1 mapping is defined, output that

- otherwise output the regular subchar

when one converts from such a codepage to Unicode and finds an unassigned code, then

- if the input sequence is of length 1 _and_

a subchar1 is specified for the codepage, output U+001a

- otherwise output U+fffd

Many IBM converters seem to not distinguish between roundtrip/fallback/subchar[1] and just have the desired default results in the Unicode->codepage runtime tables.

ICU currently implements much but not all of this. (I am opening a feature request for what is missing.)

** This is not an ICU invention. It is a long-standing feature of IBM Unicode conversion tables. I am proposing it for the charset XML so that an XML file can represent this feature in IBM Unicode conversion tables.

I am counting 58 out of 343 IBM Unicode conversion tables with subchar1 specifications and explicit ("|2") subchar1 mappings.

Sincerely,

markus

-------------------------------------------------------------------------------

Subject: proposal: stateful (SI/SO) encoding - single xml mapping file

Date: Fri, 27 Apr 2001 14:02:34 -0700

Dear fellow legacy encoding victims,

This is another proposal to extend the expressiveness of the charset XML format to what common IBM Unicode conversion tables use.

IBM uses many encodings that are stateful. To be more precise, the EBCDIC multibyte encodings all use exactly two states and change between them with SI and SO control codes. There are a few ASCII-based SI/SO encodings as well. (As it happens, the byte values for SI and SO are the same in EBCDIC and ASCII!)

Such stateful encodings are announced and tracked with a single CCSID (IBM encoding ID) and are listed in the repository with one single mapping table that lists mappings for both states together. The mappings are implicitly (and at runtime) distinguished by their numbers of bytes per character:

1 in the initial state, and 2 in the other state.

(The double-byte lead byte ranges overlap a lot with the single-byte codes.)

** Proposal(a):

This would be easy in the charset XML format, too: just listing 1- or 2-byte values in b attributes in the assignments sub-elements.

The benefit is to directly convert one single IBM Unicode conversion table into and from one single XML file.

** Proposal(b):

However, I am not so sure about the best way to announce this with the validity spec.

I propose to have some new element be a sub-element to <characterMapping>, and be alternative to <validity>.

A name for this element could be <stateful_siso> or similar.

It itself could have exactly two <validity> sub-elements, which must define only one single-byte and one double-byte state table, respectively.

(Is this scheme ever used with single/triple-byte combinations??)

For example,

<!ELEMENT characterMapping (history?, (validity|stateful_siso), assignments)>

<!ELEMENT stateful_siso (validity, validity)>

**Examples and statistics:

I am counting 20 out of 343 IBM Unicode conversion tables that are "EBCDIC_STATEFUL" as described here. These are for all of the multibyte EBCDIC encodings that IBM is using (as far as the repository is complete). All but 2 are also using the subchar1 that I proposed yesterday.

For example, ibm-939_P120-2000.ucm (http://oss.software.ibm.com/cvs/icu/charset/data/ucm/ibm-939_P120-2000.ucm).

In addition, I currently find one ASCII-based SI/SO-stateful IBM mapping table (CCSID 25546) which could also be covered by this proposal.

This is actually a mapping table (including both single-byte and double-byte mappings) for ISO-2022-KR and may be better handled by whatever we come up with for ISO-2022 encodings.

Sincerely,

Markus

--------------------------------------------------------------------------------

Subject: proposal for charset xml: iso-2022

Date: Fri, 27 Apr 2001 14:43:43 -0700

To make the charset XML format reasonably complete for legacy encodings, I would like to propose to include specifications for ISO-2022 variants.

Country- or vendor-specific ISO-2022 encodings are used frequently on the Internet.

Their "very stateful" nature makes it infeasible (in my opinion) to describe one of them fully with one single XML file.

Instead, I propose to extend the XML format to provide a kind of table of contents for an ISO-2022 encoding as an alternative to the usual validity and assignments.

Goal: I think the goal is "only" to identify which escape sequences and state shifts are associated with which mapping tables.

Non-goal: I expect that each country variant of ISO 2022 has its own ways and semantics of combining escape sequences with SO/SS2/SS3 codes, and that it would be difficult and of not too much use to try to express all the details in XML.

This could look informally like this (I would like to discuss this first before we delve into DTD):

...

<so>

</so>

<ss2>

</ss2>

<ss3>

</ss3>

</iso2022>

All of these elements just match invocation sequences with canonical names of mapping tables.

I admit that I need to read more about ISO 2022 to confirm that this matches the ISO 2022 framework reasonably well.

I am basing my proposal and samples on Ken Lunde's CJKV book.

The idea is to list designator sequences (which announce but do not shift) under the codes that shift to them. SS3 (Single Shift 3) in ISO-2022-CN can shift to CNS planes 3 to 7 depending on which designator sequence preceded it.

Escape sequences look similar but themselves cause an immediate shift. I put escape sequences on the same level as shift codes. The example escape sequence above is from ISO-2022-JP.

Sincerely,

markus