L2/01-180
From:
Markus Scherer [markus.scherer@jtcsv.com]
Sent:
Tuesday, May 01, 2001 9:23 PM
Agenda: 3 proposals to extend the charset
XML format
I have
sent 3 proposals to the mapping list last week seeking additions to the charset
XML format.
The
goal is to be able to describe encodings that are currently not within the
scope of the format, especially IBM and ISO 2022 encodings.
The
feedback so far was only one email from Martin Hosken with questions.
I am
appending the 3 emails here.
Subject: proposal for charset xml: subchar1
Subject: proposal: stateful (SI/SO) encoding
- single xml mapping file
Subject: proposal for charset xml: iso-2022
Thanks,
markus
================================================================================
Subject:
proposal for charset xml: subchar1
Date:
Thu, 26 Apr 2001 15:12:24 -0700
Dear
fellow charset mappers,
I would
like to propose additional attributes and elements for the charset XML format
to support an "interesting" feature in IBM conversion tables:
IBM
mapping tables for multibyte codepages define an additional, alternate codepage
substitution character which is always a single-byte code. In this case, the
regular substitution character is always a double-byte code.
These
mapping tables then also list in the mapping section which unassigned code
points should map to this alternate subchar1 instead of to the regular subchar.
This is
not used in IBM mapping tables for single-byte codepages.
**
Proposal:
1. To
add an optional attribute "sub1" to the <assignments> element.
2. To
add an additional element <sub1> as a sub-element for
<assignments>.
The <sub1> element should have
"u", "c", and "v" attributes like <fub>.
In DTD:
<!ELEMENT
assignments (a*, fub*, fbu*, sub1*, range*)>
<!ATTLIST
assignments
sub NMTOKENS "1A"
sub1 NMTOKEN #IMPLIED
>
<!ELEMENT
sub1 EMPTY>
<!ATTLIST
sub1
u NMTOKENS #REQUIRED
c CDATA #IMPLIED
v CDATA #IMPLIED
>
**
Usage of subchar1 in IBM mapping tables and converters:
The
idea is, I think, that characters are "wide" or "narrow".
In legacy codepages, this is identified with the codes being single-byte or
double-byte codes.
In
mappings between two legacy codepages:
When a
wide (double-byte) character is unassigned, it results in a double-byte
subchar. When a narrow (single-byte) character is unassigned, it results in a
single-byte subchar1.
This is
emulated in Unicode<->codepage mapping tables by
-
declaring the additional subchar1, and by
-
adding one-way mappings from Unicode to the codepage-subchar1
where desired for "narrow"
characters;
also by
- using
U+001a as a "Unicode subchar1"
Typically,
all unassigned Latin-1 characters (Unicode<=U+00ff) have subchar1 mappings,
but also some other code points do.
Examples
from ibm-1363_P11B-2000.ucm (Korean),
subchar1
mappings are marked with "|2":
...
<subchar> \xA1\xE0
<subchar1> \x7F
...
CHARMAP
...
<U0080>
\x7F |2
<U0081>
\x7F |2
<U0082>
\x7F |2
...
<U009F>
\x7F |2
<U00A0>
\x7F |2
<U00A1>
\xA2\xAE |0
<U00A2>
\x7F |2
<U00A3>
\x7F |2
<U00A4>
\xA2\xB4 |0
<U00A5>
\x7F |2
...
<U00FD>
\x7F |2
<U00FE>
\xA9\xAD |0
<U00FF>
\x7F |2
...
<U203E>
\x7F |2
...
<UFFA0>
\x7F |2
<UFFA1>
\x7F |2
<UFFA2>
\x7F |2
...
END
CHARMAP
* This
means that
when
one converts from Unicode to such a codepage and finds an unassigned code
point, then
- if a
subchar1 mapping is defined, output that
-
otherwise output the regular subchar
when
one converts from such a codepage to Unicode and finds an unassigned code, then
- if
the input sequence is of length 1 _and_
a subchar1 is specified for the codepage,
output U+001a
-
otherwise output U+fffd
Many
IBM converters seem to not distinguish between roundtrip/fallback/subchar[1]
and just have the desired default results in the Unicode->codepage runtime
tables.
ICU
currently implements much but not all of this. (I am opening a feature request
for what is missing.)
** This
is not an ICU invention. It is a long-standing feature of IBM Unicode
conversion tables. I am proposing it for the charset XML so that an XML file
can represent this feature in IBM Unicode conversion tables.
I am
counting 58 out of 343 IBM Unicode conversion tables with subchar1
specifications and explicit ("|2") subchar1 mappings.
Sincerely,
markus
-------------------------------------------------------------------------------
Subject:
proposal: stateful (SI/SO) encoding - single xml mapping file
Date:
Fri, 27 Apr 2001 14:02:34 -0700
Dear
fellow legacy encoding victims,
This is
another proposal to extend the expressiveness of the charset XML format to what
common IBM Unicode conversion tables use.
IBM
uses many encodings that are stateful. To be more precise, the EBCDIC multibyte
encodings all use exactly two states and change between them with SI and SO
control codes. There are a few ASCII-based SI/SO encodings as well. (As it
happens, the byte values for SI and SO are the same in EBCDIC and ASCII!)
Such
stateful encodings are announced and tracked with a single CCSID (IBM encoding
ID) and are listed in the repository with one single mapping table that lists
mappings for both states together. The mappings are implicitly (and at runtime)
distinguished by their numbers of bytes per character:
1 in
the initial state, and 2 in the other state.
(The
double-byte lead byte ranges overlap a lot with the single-byte codes.)
**
Proposal(a):
This
would be easy in the charset XML format, too: just listing 1- or 2-byte values
in b attributes in the assignments sub-elements.
The
benefit is to directly convert one single IBM Unicode conversion table into and
from one single XML file.
**
Proposal(b):
However,
I am not so sure about the best way to announce this with the validity spec.
I
propose to have some new element be a sub-element to <characterMapping>,
and be alternative to <validity>.
A name
for this element could be <stateful_siso> or similar.
It
itself could have exactly two <validity> sub-elements, which must define
only one single-byte and one double-byte state table, respectively.
(Is
this scheme ever used with single/triple-byte combinations??)
For
example,
<!ELEMENT
characterMapping (history?, (validity|stateful_siso), assignments)>
<!ELEMENT
stateful_siso (validity, validity)>
**Examples
and statistics:
I am
counting 20 out of 343 IBM Unicode conversion tables that are
"EBCDIC_STATEFUL" as described here. These are for all of the
multibyte EBCDIC encodings that IBM is using (as far as the repository is
complete). All but 2 are also using the subchar1 that I proposed yesterday.
For
example, ibm-939_P120-2000.ucm
(http://oss.software.ibm.com/cvs/icu/charset/data/ucm/ibm-939_P120-2000.ucm).
In
addition, I currently find one ASCII-based SI/SO-stateful IBM mapping table
(CCSID 25546) which could also be covered by this proposal.
This is
actually a mapping table (including both single-byte and double-byte mappings)
for ISO-2022-KR and may be better handled by whatever we come up with for
ISO-2022 encodings.
Sincerely,
Markus
--------------------------------------------------------------------------------
Subject:
proposal for charset xml: iso-2022
Date:
Fri, 27 Apr 2001 14:43:43 -0700
To make
the charset XML format reasonably complete for legacy encodings, I would like
to propose to include specifications for ISO-2022 variants.
Country-
or vendor-specific ISO-2022 encodings are used frequently on the Internet.
Their
"very stateful" nature makes it infeasible (in my opinion) to
describe one of them fully with one single XML file.
Instead,
I propose to extend the XML format to provide a kind of table of contents for
an ISO-2022 encoding as an alternative to the usual validity and assignments.
Goal: I
think the goal is "only" to identify which escape sequences and state
shifts are associated with which mapping tables.
Non-goal:
I expect that each country variant of ISO 2022 has its own ways and semantics
of combining escape sequences with SO/SS2/SS3 codes, and that it would be
difficult and of not too much use to try to express all the details in XML.
This
could look informally like this (I would like to discuss this first before we
delve into DTD):
...
<iso2022>
<escape sequence="1b 28 4a"
name="jis-roman"/>
<so>
<designator sequence="1b 24 29
41" name="gb-2312_80-1980"/>
<designator sequence="1b 24 29
47" name="cns-11643_2-1992"/>
<designator sequence="1b 24 29
45" name="iso-ir_165-1992"/>
</so>
<ss2>
<designator sequence="1b 24 2a
48" name="cns-11643_2-1992"/>
</ss2>
<ss3>
<designator sequence="1b 24 2b
49" name="cns-11643_3-1992"/>
<designator sequence="1b 24 2b
4a" name="cns-11643_4-1992"/>
</ss3>
</iso2022>
All of
these elements just match invocation sequences with canonical names of mapping
tables.
I admit
that I need to read more about ISO 2022 to confirm that this matches the ISO
2022 framework reasonably well.
I am
basing my proposal and samples on Ken Lunde's CJKV book.
The
idea is to list designator sequences (which announce but do not shift) under
the codes that shift to them. SS3 (Single Shift 3) in ISO-2022-CN can shift to
CNS planes 3 to 7 depending on which designator sequence preceded it.
Escape
sequences look similar but themselves cause an immediate shift. I put escape
sequences on the same level as shift codes. The example escape sequence above
is from ISO-2022-JP.
Sincerely,
markus