CESU-8: to document or not

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Mon Sep 17 2001 - 13:54:02 EDT


Folks,

I've been following this thread for awhile and it seems that I can make a small contribution.

Several comments have been made about why we should NOT document this and give it some kind of official imprimatur. I agree that it will generate more confusion and may be used in unforeseen ways by unwary people who don't take time to read the documentation.

However: the comments about this encoding being confined to the Evil Doers Who Practice It is faulty. Here at webMethods we have something like 90 product "adapters": pieces of software that talk to a specific application. As a result, I am aware of the vast range of variation in character set and encoding support available to product designers. One problem that we are approaching is that the changes to UTF-8 (to prohibit non-shortest-form) *are* changes and that the products I work on do not have the option of rejecting "malformed" data. Adapters must accept the way in which Oracle or Peoplesoft have implemented their system (for example) and deal with it correctly, with a minimum loss of data.

By providing a documented, standard way to refer to legacy versions of these products and their encodings, I can more readily rely on having a well-documented range of protocols and procedures for converting and validating data exchanged with these systems. The argument that these products "merely support an older version of the Unicode standard" is specious, because the older versions merely made the six-byte form permissable by way of omission (the six-byte form was *never* the preferred form). The older versions say nothing about mixing the two forms, for example. Whether we dignify this encoding with a name or not, someone needs to fully document the rules and provide a stable basis for supporting this usage.

For what it's worth, I thank Toby for braving the heat to produce this document. As a practical matter, I don't support the creation of new CESU-8 systems and will be grappling for a place on the walls to throw hot oil down on the barbarians who propose them, but for supporting our existing legacies (which cannot merely be extinguished "in the next release"), I think the effort is valuable. And the wording of the UTR seemed restrictive enough to me, at least, to be able to support the UTR (since it provides me the ammunition to oppose its adoption in practice).

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone) +1 408.210.3659 (mobile)
-------------------------------------------------
Internationalization is an architecture. It is not a feature.
webMethods--THE Software Integration Company



This archive was generated by hypermail 2.1.2 : Mon Sep 17 2001 - 12:36:56 EDT