Beyond UTR22: Complex Legacy-to-Unicode MappingsJonathan Kew - SIL International
Purpose:To investigate needs for complex mapping between non-standard legacy encodings and Unicode, and to explore a processing model appropriate for such mappings. While Unicode was designed to facilitate easy mapping of data in most industry-standard legacy encodings, there are many "custom fonts" in use around the world which effectively represent additional, non-standard encodings. In some cases, these may encode many presentation forms, such as variants of overstriking accents, or characters encoded in an order that does not match Unicode. The standard format for mapping descriptions presented in UTR22 is not adequate to support such encodings, especially when round-trip conversion is required. Likewise, tools based on this standard are not powerful and flexible enough. This paper, an updated version of one presented at IUC22, will illustrate the issues by considering the types of complexity seen in a variety of custom legacy encodings. It then describes a processing model and description language we have developed to address such data conversion needs, and shows how this can be applied to help users migrate from legacy systems with custom fonts to standard, Unicode-based systems. In conclusion, I will suggest how the UTR22 mapping description format might be extended to support complex mapping processes, and briefly demonstrate some software tools based on this model for complex mappings. |
When the world wants to talk, it speaks Unicode |
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
to info@global-conference.com.
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. 12 December 2002, Webmaster |