L2/00-300 From: Kenneth Whistler [kenw@sybase.com] Sent: Friday, September 01, 2000 4:40 PM Subject: Some technical issues regarding the future of SC22/WG20 ================================================================ Arnold Winkler has recently raised a number of issues regarding the future of SC22/WG20 and the standards that it maintains or has under development, for consideration at the upcoming SC22 plenary in Nara. Chief among the issues he raised is whether WG20 is now at the end of its useful life, and whether it should be sunsetted, with its various projects redistributed over time to other committees as appropriate for maintenance. I want to review some of the technical issues that may have a bearing on where such maintenance should be done, and to further consider whether some of the projects currently under development in WG20 have enough technical merit to warrant their continuation in some other committee, should WG20 itself be dissolved sometime in the not-so-distant future. (Presumably any such dissolution would be judiciously staged, over a 1-to-2 year period, to allow completion, termination, or transfer of responsibilities, as appropriate.) The charter of WG20 was fairly broad: standards in the area of internationalization, as reflected in the first published TR developed by WG20: TR 11017, "Framework for internationalization". However, the committee has, in recent years, focused on a few significant areas, so I will concentrate my comments on those areas that have, de facto, constituted the majority of WG20's work. 1. Collation WG20 developed ISO 14651, soon to be approved and published as an international standard. This standard needs an immediate amendment, to deal with the larger repertoire of characters added for 10646-1:2000 (= Unicode 3.0). The question arises as to the appropriate venue for that maintenance, if not WG20. The alternatives being argued are SC22 or SC2. This issue is actually rather easy to resolve on technical grounds. The character-related expertise in SC2, and in particular in SC2/WG2 (maintainer of ISO 10646) is exactly what is needed to be able to do the extensions of the tables required for ISO 14651. And that is in fact the main work that will need to be done for 14651 maintenance. The architecture for string ordering in 14651 is complete -- 14651 is just in need of extension of the weights listed in the tailorable template table, to keep up with the continual additions of characters to 10646. The best way to accomplish that is to keep that standard with the committee that actually does the additions of the characters -- they know what the characters are and would best be able to do timely coordination of updates for a related standard that needs to add those characters to its tables. Furthermore, among the active participants in WG2 are the experts on collation (with implementation experience) who actually ended up authoring much of the content of 14651. Comparable experience is not obviously available in the SC22 committees other than WG20. Furthermore, because of the current close working relationship between WG2 and the Unicode Technical Committee, WG2 is also the best place to maintain a standard that should stay in synch with the Unicode Collation Algorithm maintained by the UTC, to prevent unanticipated "drift" between the two standards. 2. Locale Extensions WG20 is developing TR 14652, "Specification Method for Cultural Conventions". The specifications defined in 14652 are very closely modeled on the definition of locale in ISO 9945, the POSIX standard, and as reflected in related documentation such as XPG4 from X/Open. In effect, it was conceived of as an extension to the locale constructs: to add more internationalization elements, as mentioned in TR 11017, into a formal syntactic construct that could be used to generate machine-readable locale definitions. So it adds definitions for LC_NAME, LC_ADDRESS, LC_IDENTIFICATION, etc. to the older groupings LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME. Furthermore, it attempts to extend the preexisting categories with new keywords to deal with collation as defined in 14651, with the new large character set defined in 10646, and new internationalization issues such as monetary formats involving the euro sign. It is pretty clear that the impetus and rationale for 14652 derive from the POSIX side. As such, it logically belongs in SC22/WG15 for further development, rather than in SC2. The participants in SC2, while interested in internationalization issues related to locales, have no particular interest or expertise in the POSIX-specific syntax extensions covered by 14652, nor do they have any expertise in ISO 9945 itself, which has to be closely tracked in the development of 14652, to avoid superfluous inconsistencies. SC2 also has no established history of working liaison relationships with SC22/WG15-- a situation which would bode ill for trying to develop what is effectively a POSIX extension in a committee ill-suited to do so. 3. Character Properties The most contentious issue regarding DTR 14652 is the effort to extend LC_CTYPE to cover the repertoire of ISO 10646-1. The contending positions effectively reflect a worldview divide among the participants regarding character properties: Position A: Character properties have not traditionally been covered by character encoding standards, and have not been viewed as the domain of the ISO committee responsible for encoding characters: SC2. Instead, character properties are an implementation issue, traditionally dealt with in the standards most directly concerned with character implementation -- namely the formal language standards -- and are dealt with in ISO by the working groups under SC22. In the context of 14652, the appropriate place to define character properties is LC_CTYPE, where the properties would be usable in a POSIX context as part of locale definitions. Position B: Character properties for the *universal* character set -- namely ISO 10646 (= Unicode) are inherent to *characters*, and should *not* be defined in locales. The locale model and LC_CTYPE were an attempt to provide a mechanism for dealing with properties of characters in alternate encodings, but that model does not scale well for dealing with properties for the universal repertoire of 10646. Furthermore, it is inappropriate to assert that character properties are defined in locales, and are thus subject to locale-specific variation, since such a position would lead to inconsistent and inexplicable differences in application behavior, depending on locale, in ways that have no bearing on the usually understood issues of locale-specific formatting differences, etc. Because character properties are closely tied to the characters themselves, responsibility for defining them should belong with the character encoding committees, rather than with the language committees -- and thus in SC2, rather than SC22. It is clear that among the rather large community of implementers of 10646 (= Unicode), Position B has much more widespread support than Position A. Position A is, however, a vocally held minority opinion among those committed to the extension of the POSIX framework. In point of actual fact, the *real* work on standardization of 10646 character properties is being done almost entirely by the Unicode Technical Committee, which for years now has been publishing machine-readable tables of character properties and associated technical reports that are in widespread implementation in many products. A very few character properties, most notably "combining" and "mirroring", are also formally maintained by SC2/WG2 in ISO 10646 itself, and those properties are tracked in parallel by the UTC. On balance, it would seem far preferable to conclude that within JTC1 any responsibility for character properties should belong to SC2, rather than SC22. Once again, this is a matter of expertise regarding the huge number of characters in 10646. That expertise is in SC2, and not in SC22. And the implementation experience regarding character properties resides in the UTC, which has a firm working relationship with SC2, but no close ties to SC22. Regarding LC_CTYPE in particular, the maintenance or extension of LC_CTYPE should be remanded to WG15, along with all of DTR 14652, but with the following recommendations: Rather than attempting to independently extend LC_CTYPE definitions to cover 10646, a mechanism should be developed whereby POSIX implementations using LC_CTYPE can make use of the more widespread and better researched and reviewed character property definitions developed by the UTC, in cooperation with SC2/WG2's development of 10646. This should be done by *reference*, rather than by enumerating lists of characters in SC22 standards or TR's, because of the danger of those lists getting out of synch or introducing errors that cause interoperability problems. Furthermore, this practice of dealing with character properties by reference to UTC and/or SC2 developed standards for them, should be recommended to *all* the SC22 committees, as the generic way to deal with character properties in formal language standards. 4. Internationalization API Standard WG20 has a project on the books, 15435, to develop an API standard for internationalization. To date, there has been very little evidence proffered that there is any actual demand for such a standard. There is no list of IT companies requesting it to solve some interoperability problem. The big OS and tools vendors are not requesting it. The Linux internationalization community has rejected it in favor of other options. The Java community has no interest -- they already have a sophisticated internationalization architecture. The Unicode Technical Committee, which has very widespread representation from the implementing community, has indicated zero interest in the 15435 project. No one in WG20 but the project editor seems to be doing any active work to develop the API standard for internationalization, and the committee feedback to date has largely been that the quality of the drafts is poor. Fundamental questions regarding the nature of the API design have not been resolved. Furthermore, there has been a lot of hand-waving over the issue of how closely tied the proposed API is to the locale extension constructs of DTR 14652. The API under development for 15435 is locale-centric, in that it requires information in an "FDCC-set" defined a la DTR 14652, assuming API behavior will depend on that information, resident in some implementation-defined "database". Modern internationalization libraries have largely eschewed that kind of locale-centric design as too constrained, instead breaking up the problem of internationalization support into more modular designs that separate out different aspects of the problems involved. Furthermore, the proposed API standard aspires to platform independent design. That, however, inappropriately conflates the issue of designing appropriate behavior for internationalization with the problem of designing appropriately abstracted API's for that behavior on distinct platforms. In actual practice, implementers are tending to make use of available libraries that surface correct internationalization behavior (such as the ICU classes) and then writing whatever wrappers are necessary to abstract that behavior into their systems. The days of trying to define complex behavior via ISO API standards, to be rolled out by language compiler vendors in standard C libraries and such, are being overtaken by object-oriented design and software component models. At this point, WG20's project 15435 should just be abandoned as a well-intentioned but obsolete project that has no demonstrated need or support for its development. 5. Cultural Registry Standard WG20 is also charged with the maintenance of the cultural registry standard, ISO 15897. That registry needs a firm review and resolution process to ensure its correctness and market relevance. WG20 should be able to provide the definition of such a resolution process, along the lines provided by ISO 2375 for the character set registry. Once the review is done, and ISO 15897 has been appropriately updated, it should be a stabilized standard, requiring little further work or attention. It will then be the responsibility of the registering agency (DKUUG) to follow the registration process and to make the cultural element registry worthwhile. 6. Identifiers An issue that WG20 has had to deal with fairly recently is the list of recommended characters for identifiers, in Annex A of TR 10176, "Guidelines for the preparation of programming language standards". Because the list of recommended characters for identifiers is based on the repertoire of ISO 10646, this is another area where repeated maintenance into the future can be foreseen, as the repertoire of 10646 continues to expand. Once again, because of the location of character expertise regarding all the characters added to 10646, the logical source for recommendations about how to extend the list in Annex A in the future is SC2. This is supported by the additional fact that determination of which characters are and are not appropriate in identifiers implicitly depends on specification of a constellation of properties for those characters -- again an area in which the expertise is located in SC2. However, there is somewhat of a conundrum here, since the remainder of the content of TR 10176 is clearly in the domain of SC22, and the TR as a whole is inappropriate for maintenance in SC2. Perhaps some kind of understanding could be arranged between the SC's to guarantee that modifications to Annex A or TR 10176 should only be made with timely, coequal input from SC2. A better solution, in the long run, would be to sever the contents of the exact table in Annex A, which has to track character repertoires and properties that are (or should be) the responsibility of SC2, from TR 10176 per se, and instead insert a reference there to a standard list maintained by SC2, either in the context of 10646 itself or in some associated TR to be developed by WG2 for this purpose. That would more appropriately divide the responsibilities for the part of TR 10176 associated with formal language syntax and design and the part which is attempting to track the universal character encoding repertoire as it expands over time. Another reason for moving in this direction is the particular interest that the Unicode Technical Committee has in the identifier content problem. The Unicode Standard has detailed recommendations regarding identifiers, and the Unicode Technical Committee is currently working on even more detailed specifications regarding identifiers and identifier-like constructs for use in various contexts on the Worldwide Web and the Internet. It is in JTC1's interest to keep this particular technical issue active in a venue, namely SC2/WG2, where the character encoding expertise is available and the working relation with the UTC is strong. Even though on the surface it might seem that programming identifier syntax clearly belongs to SC22, the real issue is not the syntax per se (which is quite simple), nor the concept of an identifier and its relation to other programming language constructs (which the UTC and SC2 have little interest in and consider to be long ago fixed and decided by the SC22 standards). No, the *real* issue that remains open and problematical is how to classify and distribute all the thousands of additional characters in 10646, and how to deal with the complex ramifications of inclusions of various compatibility characters which may or may not change under various kinds of identifier normalization processes. That is where the UTC and WG2 expertise would be most helpful, and where joint development of Unicode and ISO standards would be most likely to minimize interoperability problems for identifiers in different programming languages and Internet and Web protocols. This entire issue, is, by the way, also of intense interest to the Database standards arena, where it is of direct relevance to the SQL standard, for example. So the SC22 working groups are not the only JTC1 groups with an interest in standard, interoperable results in this area for 10646 characters. 7. Case Mapping and Case Folding WG20 has not spent much time dealing with case mapping and case folding issues, although those clearly have an internationalization angle, because of local differences in case mapping preferences. The one point where this has been dealt with by WG20 is in the LC_CTYPE specification in DTR 14652. This is because LC_CTYPE is the location of the information used by the tolower() and toupper() case mapping transforms for C (and by extension, other languages). As a result, PDTR 14652 includes tables of case pairs for all of the 10646 characters that have case pairs. However, the inclusion of these case mappings explicitly in the "i18n" LC_CTYPE definition in DTR 14652 has been controversial in the committee, in part because of a small number of unexplained inconsistencies between those tables and the case mappings provided by the Unicode Consortium on its website. The Unicode case mappings are very widely implemented in many products, and are being treated by the industry as a de facto standard. So it is problematical for DTR 14652 to be proposing slightly different case mappings for a standards document that contradict widespread practice. This is once again an area where the JTC1 standards arena would be better served by using references to de facto practice, rather than trying to reinvent the wheel with long lists in other standards or TR's, subject to the introduction of error or drift that can introduce interoperability problems. Perhaps here the SC22 language working groups could work with SC2/WG2 to find a way to get the de facto Unicode tables to be referenceable through an SC2 TR of some sort, to avoid the synchronization issues of trying to maintain two (huge) lists separately. The area of case folding is related to case mapping, but is subtly different. WG20 has not dealt with this issue, but it is clear that SC22 language working groups need to deal with this. In particular, COBOL, Pascal, and other languages that have case-insensitive identifiers, need to be able to do reliable case-folding during their parsing/lexing phases of program text interpretation. For that, they need reliable definitions of case-folding as applied to 10646 characters for the domain of characters allowed inside identifiers for each language. While WG20 has not touched on this issue and the SC22 working groups are starting to search for an answer, the Unicode Technical Committee and the IETF have moved ahead, creating de facto solutions that will see widespread implementation in the near future. The Unicode Technical Committee has already published CaseFolding.txt, a machine-readable file with recommendations on exactly how to do case-folding for all Unicode 3.0 characters (i.e. 10646-1:2000 characters). The SC22 committees should be reviewing that file, and the associated case mapping information available in UnicodeData.txt and in SpecialCasing.txt -- also available on the Unicode website -- before concluding that new standardization efforts need to be initiated in SC22 (whether in WG20 or in other working groups), to repeat the work involved in creating those files, which are already freely available to all implementers. The UTC and the IETF are currently working on the even thornier problem of determining how best to define identifiers in a context (such as internationalized domain names) where certain characters are disallowed (such as punctuation that has other reserved uses in URL syntax), where case folding is required, where normalization of data is also required (disallowing of equivalent sequences that might otherwise appear identical), and where even visual look-a-likes of otherwise different characters are to be avoided if possible because of the confusion they can pose for user entry and the possibility of spoofing. This is an area where intimate knowledge of all the characters in 10646 and their interaction of properties and appearances is required. Yet again, it would behoove the SC22 working groups to participate in the joint UTC/IETF effort in this area through review and feedback, rather than trying to reinvent the wheel in a committee context where less relevant expertise would be available to start with.