L2/00-300

From: Kenneth Whistler [kenw@sybase.com]
Sent: Friday, September 01, 2000 4:40 PM
Subject: Some technical issues regarding the future of SC22/WG20

================================================================

Arnold Winkler has recently raised a number of issues regarding the future
of SC22/WG20 and the standards that it maintains or has under
development, for consideration at the upcoming SC22 plenary in Nara.
Chief among the issues he raised is whether WG20 is now at the
end of its useful life, and whether it should be sunsetted, with
its various projects redistributed over time to other committees as 
appropriate for maintenance.

I want to review some of the technical issues that may have a bearing
on where such maintenance should be done, and to further consider
whether some of the projects currently under development in WG20
have enough technical merit to warrant their continuation in some
other committee, should WG20 itself be dissolved sometime in the
not-so-distant future. (Presumably any such dissolution would be
judiciously staged, over a 1-to-2 year period, to allow completion,
termination, or transfer of responsibilities, as appropriate.)

The charter of WG20 was fairly broad: standards in the area of
internationalization, as reflected in the first published TR 
developed by WG20: TR 11017, "Framework for internationalization".
However, the committee has, in recent years, focused on a few
significant areas, so I will concentrate my comments on those areas
that have, de facto, constituted the majority of WG20's work.

1. Collation

WG20 developed ISO 14651, soon to be approved and published as an
international standard. This standard needs an immediate
amendment, to deal with the larger repertoire of characters added
for 10646-1:2000 (= Unicode 3.0). The question arises as to the
appropriate venue for that maintenance, if not WG20. The alternatives
being argued are SC22 or SC2.

This issue is actually rather easy to resolve on technical grounds. The
character-related expertise in SC2, and in particular in SC2/WG2
(maintainer of ISO 10646) is exactly what is needed to be able to
do the extensions of the tables required for ISO 14651. And that is
in fact the main work that will need to be done for 14651 maintenance.
The architecture for string ordering in 14651 is complete -- 14651 is
just in need of extension of the weights listed in the tailorable
template table, to keep up with the continual additions of characters
to 10646. The best way to accomplish that is to keep that standard
with the committee that actually does the additions of the
characters -- they know what the characters are and would best be
able to do timely coordination of updates for a related standard that
needs to add those characters to its tables.

Furthermore, among the active participants in WG2 are the experts
on collation (with implementation experience) who actually ended
up authoring much of the content of 14651. Comparable experience is
not obviously available in the SC22 committees other than WG20.
Furthermore, because of the current close working relationship
between WG2 and the Unicode Technical Committee, WG2 is also the
best place to maintain a standard that should stay in synch with
the Unicode Collation Algorithm maintained by the UTC, to prevent
unanticipated "drift" between the two standards.

2. Locale Extensions

WG20 is developing TR 14652, "Specification Method for Cultural
Conventions". The specifications defined in 14652 are very closely
modeled on the definition of locale in ISO 9945, the POSIX standard,
and as reflected in related documentation such as XPG4 from X/Open.
In effect, it was conceived of as an extension to the locale
constructs: to add more internationalization elements, as mentioned
in TR 11017, into a formal syntactic construct that could be used
to generate machine-readable locale definitions. So it adds
definitions for LC_NAME, LC_ADDRESS, LC_IDENTIFICATION, etc. to
the older groupings LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY,
LC_NUMERIC, and LC_TIME. Furthermore, it attempts to extend the
preexisting categories with new keywords to deal with collation
as defined in 14651, with the new large character set defined in
10646, and new internationalization issues such as monetary
formats involving the euro sign.

It is pretty clear that the impetus and rationale for 14652 derive
from the POSIX side. As such, it logically belongs in SC22/WG15 for
further development, rather than in SC2. The participants in SC2,
while interested in internationalization issues related to locales,
have no particular interest or expertise in the POSIX-specific
syntax extensions covered by 14652, nor do they have any expertise
in ISO 9945 itself, which has to be closely tracked in the development
of 14652, to avoid superfluous inconsistencies. SC2 also has no
established history of working liaison relationships with SC22/WG15--
a situation which would bode ill for trying to develop what is
effectively a POSIX extension in a committee ill-suited to do so.

3. Character Properties

The most contentious issue regarding DTR 14652 is the effort to
extend LC_CTYPE to cover the repertoire of ISO 10646-1. The contending
positions effectively reflect a worldview divide among the participants
regarding character properties:

Position A: Character properties have not traditionally been covered
by character encoding standards, and have not been viewed as the
domain of the ISO committee responsible for encoding characters: SC2.
Instead, character properties are an implementation issue, traditionally
dealt with in the standards most directly concerned with character
implementation -- namely the formal language standards -- and are
dealt with in ISO by the working groups under SC22. In the context
of 14652, the appropriate place to define character properties is
LC_CTYPE, where the properties would be usable in a POSIX context as
part of locale definitions.

Position B: Character properties for the *universal* character set --
namely ISO 10646 (= Unicode) are inherent to *characters*, and should
*not* be defined in locales. The locale model and LC_CTYPE were an
attempt to provide a mechanism for dealing with properties of characters
in alternate encodings, but that model does not scale well for dealing
with properties for the universal repertoire of 10646. Furthermore,
it is inappropriate to assert that character properties are defined
in locales, and are thus subject to locale-specific variation, since
such a position would lead to inconsistent and inexplicable differences
in application behavior, depending on locale, in ways that have
no bearing on the usually understood issues of locale-specific
formatting differences, etc. Because character properties are closely
tied to the characters themselves, responsibility for defining them
should belong with the character encoding committees, rather than
with the language committees -- and thus in SC2, rather than SC22.

It is clear that among the rather large community of implementers
of 10646 (= Unicode), Position B has much more widespread support
than Position A. Position A is, however, a vocally held minority
opinion among those committed to the extension of the POSIX framework.

In point of actual fact, the *real* work on standardization of
10646 character properties is being done almost entirely
by the Unicode Technical Committee, which for years now has been
publishing machine-readable tables of character properties and
associated technical reports that are in widespread implementation
in many products. A very few character properties, most notably
"combining" and "mirroring", are also formally maintained by SC2/WG2 in
ISO 10646 itself, and those properties are tracked in parallel by
the UTC.

On balance, it would seem far preferable to conclude that within
JTC1 any responsibility for character properties should belong
to SC2, rather than SC22. Once again, this is a matter of expertise
regarding the huge number of characters in 10646. That expertise
is in SC2, and not in SC22. And the implementation experience
regarding character properties resides in the UTC, which has a
firm working relationship with SC2, but no close ties to SC22.

Regarding LC_CTYPE in particular, the maintenance or extension of 
LC_CTYPE should be remanded to WG15, along with all of DTR 14652,
but with the following recommendations: Rather than attempting to
independently extend LC_CTYPE definitions to cover 10646, a mechanism
should be developed whereby POSIX implementations using LC_CTYPE
can make use of the more widespread and better researched and
reviewed character property definitions developed by the UTC, in
cooperation with SC2/WG2's development of 10646. This should be
done by *reference*, rather than by enumerating lists of characters
in SC22 standards or TR's, because of the danger of those lists
getting out of synch or introducing errors that cause interoperability
problems. Furthermore, this practice of dealing with character
properties by reference to UTC and/or SC2 developed standards
for them, should be recommended to *all* the SC22 committees, as
the generic way to deal with character properties in formal
language standards.

4. Internationalization API Standard

WG20 has a project on the books, 15435, to develop an API standard
for internationalization. To date, there has been very little
evidence proffered that there is any actual demand for such a
standard. There is no list of IT companies requesting it to solve
some interoperability problem. The big OS and tools vendors are not
requesting it. The Linux internationalization community has rejected it
in favor of other options. The Java community has no interest -- they
already have a sophisticated internationalization architecture. The Unicode
Technical Committee, which has very widespread representation from
the implementing community, has indicated zero interest in the
15435 project. 

No one in WG20 but the project editor seems to be doing any active 
work to develop the API standard for internationalization, and the
committee feedback to date has largely been that the quality of
the drafts is poor. Fundamental questions regarding the nature
of the API design have not been resolved. Furthermore, there has
been a lot of hand-waving over the issue of how closely tied the
proposed API is to the locale extension constructs of DTR 14652.
The API under development for 15435 is locale-centric, in that
it requires information in an "FDCC-set" defined a la DTR 14652,
assuming API behavior will depend on that information, resident
in some implementation-defined "database".

Modern internationalization libraries have largely eschewed that
kind of locale-centric design as too constrained, instead breaking up
the problem of internationalization support into more modular
designs that separate out different aspects of the problems
involved.

Furthermore, the proposed API standard aspires to platform
independent design. That, however, inappropriately conflates the
issue of designing appropriate behavior for internationalization
with the problem of designing appropriately abstracted API's
for that behavior on distinct platforms. In actual practice,
implementers are tending to make use of available libraries that
surface correct internationalization behavior (such as the
ICU classes) and then writing whatever wrappers are necessary to
abstract that behavior into their systems. The days of trying
to define complex behavior via ISO API standards, to be rolled
out by language compiler vendors in standard C libraries and such,
are being overtaken by object-oriented design and software
component models.

At this point, WG20's project 15435 should just be abandoned as
a well-intentioned but obsolete project that has no demonstrated
need or support for its development.

5. Cultural Registry Standard

WG20 is also charged with the maintenance of the cultural registry
standard, ISO 15897. That registry needs a firm review and
resolution process to ensure its correctness and market relevance.
WG20 should be able to provide the definition of such a resolution
process, along the lines provided by ISO 2375 for the character
set registry. Once the review is done, and ISO 15897 has been
appropriately updated, it should be a stabilized standard, requiring
little further work or attention.

It will then be the responsibility of the registering agency (DKUUG)
to follow the registration process and to make the cultural element
registry worthwhile.

6. Identifiers

An issue that WG20 has had to deal with fairly recently is the
list of recommended characters for identifiers, in Annex A of
TR 10176, "Guidelines for the preparation of programming
language standards". Because the list of recommended characters
for identifiers is based on the repertoire of ISO 10646, this
is another area where repeated maintenance into the future can
be foreseen, as the repertoire of 10646 continues to expand.

Once again, because of the location of character expertise regarding
all the characters added to 10646, the logical source for recommendations
about how to extend the list in Annex A in the future is SC2. This
is supported by the additional fact that determination of which
characters are and are not appropriate in identifiers implicitly
depends on specification of a constellation of properties
for those characters -- again an area in which the expertise is
located in SC2.

However, there is somewhat of a conundrum here, since the remainder
of the content of TR 10176 is clearly in the domain of SC22, and the
TR as a whole is inappropriate for maintenance in SC2. Perhaps
some kind of understanding could be arranged between the SC's
to guarantee that modifications to Annex A or TR 10176 should only be made
with timely, coequal input from SC2.

A better solution, in the long run, would be to sever the contents
of the exact table in Annex A, which has to track character repertoires
and properties that are (or should be) the responsibility of SC2,
from TR 10176 per se, and instead insert a reference there to a
standard list maintained by SC2, either in the context of 10646
itself or in some associated TR to be developed by WG2 for this
purpose. That would more appropriately divide the responsibilities
for the part of TR 10176 associated with formal language syntax
and design and the part which is attempting to track the universal
character encoding repertoire as it expands over time.

Another reason for moving in this direction is the particular interest
that the Unicode Technical Committee has in the identifier content
problem. The Unicode Standard has detailed recommendations regarding
identifiers, and the Unicode Technical Committee is currently working
on even more detailed specifications regarding identifiers and
identifier-like constructs for use in various contexts on the Worldwide
Web and the Internet. It is in JTC1's interest to keep this particular
technical issue active in a venue, namely SC2/WG2, where the character
encoding expertise is available and the working relation with the UTC
is strong. Even though on the surface it might seem that programming
identifier syntax clearly belongs to SC22, the real issue is not the
syntax per se (which is quite simple), nor the concept of an identifier
and its relation to other programming language constructs (which the
UTC and SC2 have little interest in and consider to be long ago
fixed and decided by the SC22 standards). No, the *real* issue that
remains open and problematical is how to classify and distribute all
the thousands of additional characters in 10646, and how to deal
with the complex ramifications of inclusions of various compatibility
characters which may or may not change under various kinds of
identifier normalization processes. That is where the UTC and WG2
expertise would be most helpful, and where joint development of
Unicode and ISO standards would be most likely to minimize
interoperability problems for identifiers in different programming
languages and Internet and Web protocols.

This entire issue, is, by the way, also of intense interest to
the Database standards arena, where it is of direct relevance
to the SQL standard, for example. So the SC22 working groups are
not the only JTC1 groups with an interest in standard,
interoperable results in this area for 10646 characters.

7. Case Mapping and Case Folding

WG20 has not spent much time dealing with case mapping and case
folding issues, although those clearly have an internationalization
angle, because of local differences in case mapping preferences.

The one point where this has been dealt with by WG20 is in the
LC_CTYPE specification in DTR 14652. This is because LC_CTYPE is
the location of the information used by the tolower() and toupper()
case mapping transforms for C (and by extension, other languages).
As a result, PDTR 14652 includes tables of case pairs for all
of the 10646 characters that have case pairs.

However, the inclusion of these case mappings explicitly in the
"i18n" LC_CTYPE definition in DTR 14652 has been controversial in
the committee, in part because of a small number of unexplained
inconsistencies between those tables and the case mappings provided
by the Unicode Consortium on its website. The Unicode case mappings
are very widely implemented in many products, and are being treated
by the industry as a de facto standard. So it is problematical for
DTR 14652 to be proposing slightly different case mappings for
a standards document that contradict widespread practice.

This is once again an area where the JTC1 standards arena would be
better served by using references to de facto practice, rather than
trying to reinvent the wheel with long lists in other standards or
TR's, subject to the introduction of error or drift that can
introduce interoperability problems. Perhaps here the SC22 language
working groups could work with SC2/WG2 to find a way to get the
de facto Unicode tables to be referenceable through an SC2 TR of
some sort, to avoid the synchronization issues of trying to maintain
two (huge) lists separately.

The area of case folding is related to case mapping, but is subtly
different. WG20 has not dealt with this issue, but it is clear
that SC22 language working groups need to deal with this. In particular,
COBOL, Pascal, and other languages that have case-insensitive
identifiers, need to be able to do reliable case-folding during their
parsing/lexing phases of program text interpretation. For that, they need
reliable definitions of case-folding as applied to 10646 characters
for the domain of characters allowed inside identifiers for each
language.

While WG20 has not touched on this issue and the SC22 working groups
are starting to search for an answer, the Unicode Technical Committee
and the IETF have moved ahead, creating de facto solutions that will
see widespread implementation in the near future.

The Unicode Technical Committee has already published CaseFolding.txt, a
machine-readable file with recommendations on exactly how to do
case-folding for all Unicode 3.0 characters (i.e. 10646-1:2000 characters).
The SC22 committees should be reviewing that file, and the associated
case mapping information available in UnicodeData.txt and in
SpecialCasing.txt -- also available on the Unicode website -- before
concluding that new standardization efforts need to be initiated in
SC22 (whether in WG20 or in other working groups), to repeat the
work involved in creating those files, which are already freely available
to all implementers.

The UTC and the IETF are currently working on the even thornier
problem of determining how best to define identifiers in a context
(such as internationalized domain names) where certain characters
are disallowed (such as punctuation that has other reserved uses in
URL syntax), where case folding is required, where normalization of
data is also required (disallowing of equivalent sequences that might
otherwise appear identical), and where even visual look-a-likes of
otherwise different characters are to be avoided if possible because
of the confusion they can pose for user entry and the possibility
of spoofing. This is an area where intimate knowledge of all the
characters in 10646 and their interaction of properties and appearances
is required. Yet again, it would behoove the SC22 working groups
to participate in the joint UTC/IETF effort in this area through
review and feedback, rather than trying to reinvent the wheel in
a committee context where less relevant expertise would be available
to start with.