Unicode Frequently Asked Questions

Internationalization

Q: In the past, we have just handed off our code to a translation agency. What's wrong with that?

Often, companies develop a first version of a program or system to just deal with English. When it comes time to produce a first international version, a common tactic is to just go through all the lines of code, and translate the literal strings.

While this may work once, it is not a pattern that you want to follow. Not all literal strings get translated, so this process requires human judgment, and is time-consuming. Each new version is expensive, since people have to go through the same process of identifying the strings that need to be changed. In addition, since there are multiple versions of the source code, maintenance and support becomes expensive. Moreover, there a high risk that a translator may introduce bugs by mistakenly modifying code.

Q: What is the IT industry's best practice for translation now?

The general technique used now is to internationalize the programs. This means to prepare them so that the code never needs modification—separate files contain the translatable information. This involves a number of modifications to the code:

  1. move all translatable strings into separate files called resource files, and make the code access those strings when needed. These resource files can be flat text files, databases, or even code resources, but they are completely separate from the main code, and contain nothing but the translatable data.
  2. change variable formatting to be language-independent. This means that dates, times, numbers, currencies, and messages all call functions to format according to local language and country requirements.
  3. change sorting, searching and other types of processing to be language-independent.

Once this process is concluded, you have an internationalized program. To localize that program then involves no changes to the source code. Instead, just the translatable files are typically handed off to contractors or translation agencies to modify. The initial cost of producing internationalized code is somewhat higher than localizing to a single market, but you only pay that once. The costs of simply doing a localization, once your code is internationalized, is a fraction of the previous cost—and avoids the considerable cost of maintenance and source code control for multiple code versions.

Q: How does Unicode play in internationalization?

Unicode is the new foundation for this process of internationalization. Older code pages were difficult to use, and have inconsistent definitions for characters. Internationalizing your code while using the same code base is complex, since you would have to support different character sets—with different architectures—for different markets.

But modern business requirements are even stronger; programs have to handle characters from a wide variety of languages at the same time: the EU alone requires several different older character sets to cover all its languages. Mixing older character sets together is a nightmare, since all data has to be tagged, and mixing data from different sources is nearly impossible to do reliably.

With Unicode, a single internationalization process can produce code that handles the requirements of all the world markets at the same time. Since Unicode has a single definition for each character, you don't get data corruption problems that plague mixed codeset programs. Since it handles the characters for all the world markets in a uniform way, it avoids the complexities of different character code architectures. All of the modern operating systems, from PCs to mainframes, support Unicode now or are actively developing support for it. The same is true of databases, as well.

Q: What was wrong with using classical character sets for application programs?

Different character sets have very different architectures. In many, even simply detecting which bytes form a character is a complex, contextually-dependent process. That means either having multiple versions of the program code for different markets, or making the program code much, much more complicated. Both of these choices involve development, testing, maintenance, and support problems. These make the non-US versions of programs more expensive, and delay their introduction, causing significant loss of revenue.

Q: What was wrong with using classical character sets for databases?

Classical character sets only handle a few languages at a time. Mixing languages was very difficult or impossible. In today's markets, mixing data from many sources all around the world, that strategy for products fails badly. The code for a simple letter like "A" will vary wildly between different sets, making searching, sorting, and other operations very difficult. There is also the problem of tagging every piece of textual data with a character set, and corruption problems when mixing text from different character sets.

Q: What is different about Unicode?

Unicode provides a unique encoding for every character. Once your data is in Unicode, it can be all handled the same way—sorted, searched, and manipulated without fear of data corruption.

Q: What about East Asian support?

Unicode incorporates the characters of all the major government standards for ideographic characters from Japan, Korea, China, and Taiwan, and more. The Unicode Standard has over 80,000 ideographic characters. The Unicode Consortium actively works with the Ideographic Research Group (IRG) of ISO SC2/WG2 to define additional sets of ideographic characters for inclusion in future versions.

Q: So all I need is Unicode, right?

Unicode is not a magic wand; it is a standard for the storage and interchange of textual data. Somewhere there has to be code that recognizes and provides for the conventions of different languages and countries. These conventions can be quite complex, and require considerable expertise to develop code for and to produce the data formats. Changing conditions and new markets also require considerable maintenance and development. Usually this support is provided in the operating system, or with a set of code libraries.

Q: Unicode has all sorts of features: combining marks, bidirectionality, input methods, surrogates, Hangul syllables, etc. Isn't a big burden to support?

Unicode by itself is not complicated to implement—it all depends on which languages you want to support. The character repertoire you need fundamentally determines the features you need to have for compliance. If you just want to support Western Europe, you don't need to have much implementation beyond what you have in ASCII.

Which further characters you need to support is really dependent on the languages you want, and what system requirements you have (servers, for example, may not need input or display). For example, if you need East Asian languages (in input), you have to have input methods. If you support Arabic or Hebrew characters (in display), then you need the bidirectional algorithm. For normal applications, of course, much of this will be handled by the operating system for you.

Q: What level of support should I look for?

Unicode support really divides up into two categories: server-side support and client-side support. The requirements for Unicode support in these two categories can be summarized as follows (although you may only need a subset of these features for your projects):

Full server-side Unicode support

This consists of:
  • Storage and manipulation of Unicode strings.

  • Conversion facilities to a full complement of other charsets (8859-x, JIS, EBCDIC, etc.)

  • A full range of formatting/parsing functionality for numbers, currencies, date/time and messages for all locales you need.

  • Message cataloging (resources) for accessing translated text.

  • Unicode-conformant collation, normalization, and text boundary (grapheme, word, line-break) algorithms.

  • Multiple locales/resources available simultaneously in the same application or thread.

  • Charset-independent locales (all Unicode characters usable in any locale).

Full client-side support

This consists all the same features as server-side, plus GUI support:
  • Displaying, printing and editing Unicode text.

    This requires:

    • BIDI display if Arabic and Hebrew characters are supported.

    • character shaping if scripts such as Arabic or the scripts of India are supported.

  • Inputting text (e.g. with Japanese input methods)

  • Full incorporation of these facilities into the windowing system and the desktop interface.