From: John Tisdale (jtisdale@ocean.org)
Date: Tue Aug 17 2004 - 22:27:59 CDT
Thanks everyone for your helpful feedback on the first draft of the MSDN
article. I couldn't fit in all of the suggestions as the Unicode portion is
only a small piece of my article. The following is the second draft based on
the corrections, additional information and resources provided.
Also, I would like to get feedback on the most accurate/appropriate term/s
for describing the CCS, CEF and CES (layers, levels, components, etc.)?
I am under a tight deadline and need to collect any final feedback rather
quickly before producing the final version.
Special thanks to Asmus for investing a lot of his time to help.
Thanks, John
-- Unicode Fundamentals A considerable amount of misinformation on Unicode has proliferated among developers and the problem has largely compounded over time. Deploying an effective Unicode solution begins with a solid understanding of the fundamentals. It should be noted that Unicode is a far too complex topic to cover in any depth here. Additional resources will be given to take the reader beyond the scope of this article. Early character sets were very limited in scope. ASCII required only 7 bits to represent its repertoire of 128 characters. ANSI pushed this scope 8 bits which represented 256 characters while providing backward compatibility with ASCII. Countless other character sets emerged that represented the characters needed by various languages and language groups. The growing complexities of managing numerous international character sets escalated the need for a much broader solution that represented the characters of virtually all written languages in a single character set. Two standards emerged about the same time to address this demand. The Unicode Consortium published the Unicode Standard and the International Organization for Standardization (ISO) offered the ISO/IEF 10646 standard. Fortunately, these two standards bodies synchronized their character sets some years ago and continue to do so as new characters are added. Yet, although the character sets are mapped identically, the standards for encoding them vary in many ways (which are beyond the scope of this article). It should be noted that if you implement Unicode you have fully implemented ISO 10646, but the inverse isn't necessarily the case as Unicode provides more granularity and restrictions in its standards (i.e., character semantics, normalization, bi-directionality, etc.). In most cases when someone refers to Unicode they usually are discussing the collective offerings of these two standards bodies (whether they realize it or not). Formally, these two are distinct standards, but the differences are not relevant for the purposes of this article. So, I will use the term Unicode in a generic manner to refer to these collective standards. The design constraints for Unicode were demanding. Consider that if all of the world's characters were placed into a single repertoire, it could have required 32 bits for encoding. Yet, this parameter would have made the solution impractical for most computing applications. The solution had to provide broad character support while offering considerable flexibility for encoding its characters in different environments and applications. To meet this challenge Unicode was designed with three unique layers or components. Understanding the distinctions of these components is critical to leveraging Unicode and deploying effective solutions. They are the coded character set (CCS), character encoding forms (CEF) and character encoding schemes (CES). In brief, the coded character set contains all of the characters in Unicode along with a corresponding integer by which each is referenced. Unicode provides three character encoding forms for transforming the character references into computer readable data. The character encoding scheme establishes how the pieces of data in the encoding form are serialized so that they can be transmitted and stored. Coded Character Sets The fact that so many developers suggest that Unicode is a 16-bit character set illustrates how much Unicode is misunderstood (and how these three layers are not differentiated). The truth is that the Unicode character set can be encoded using 8, 16 or 32 bits (if you get nothing else out of this article than that, at least get that point straight - and help play a role in the demise of this ruse by passing it on to others). A coded character set (sometimes called a character repertoire) is a mapping from a set of abstract characters to a set of nonnegative, noncontiguous integers (between 0 and 1,114,111, called code points). The Unicode Standard contains one and only one coded character set (which is precisely synchronized with the ISO/IEF 10646 character set). This character set contains most characters in use by most written languages (including some dead languages) along with special characters used for mathematical and other specialized applications. Each character in Unicode is represented by a code point. These integer-based values are typically notated as U + hexadecimal code point. Each code point represents a given character in the Unicode character repertoire. For example, the English uppercase letter A is represented as U+0041. If you are using Windows 2000, XP or 2003, you can run the charmap utility to see how characters are mapped in Unicode on your system. These operating systems are built on UTF-16 encoded Unicode. Character Encoding Forms The second component in Unicode is character encoding forms. Their purpose is to map sets of code points contained in the character repertoire to sequences of code units that can be represented in a computing environment (using fixed-width or variable-width coding). The Unicode Standard provides three forms for encoding its repertoire (UTF-8, UTF-16 and UTF-32). You will often find references to USC-2 and USC-4. These are competing encoding forms offered by ISO/IEF 10646 (USC-2 is equivalent to UTF-16 and USC-4 to UTF-32). I will not discuss the distinctions and merits of each and will suggest simply that most implementations today use UTF-16 and UTF-32 (even though some are occasionally mislabeled as USC-2 and USC-4). As you might expect, UTF-32 is a 32-bit encoding form. That is, each character is encoded using 4 bytes and 4 bytes only. Although this method provides a fixed-width means of encoding, the overhead in terms of wasted system resources (i.e. memory, disk space, transmission bandwidth) is significant enough to limit its implementation (as at least half of the 32 bits will contain zeros in the majority of applications). Except in some UNIX operating systems and specialized applications with specific needs, UTF-32 is seldom implemented as an end-to-end solution (yet it does have its strengths in certain applications). UTF-16 is the default means of encoding the Unicode character repertoire (which has perhaps played a role in the misnomer that Unicode is a 16-bit character set). As you might expect, it is based on a 16-bit encoding form (each code unit is represented by 2 bytes). But, this isn't the whole story. Since 16-bits are capable of accessing only 65,536 characters, you might guess that if you needed to access more characters than that you would be forced to use UTF-32. But, this isn't the case. UTF-16 has the ability to combine pairs of 16-bit code units in cases in which 16 bits are inadequate. When code units are paired in this manner they are called surrogates. In most cases a single 16-bit code unit is adequate because the most commonly used characters in the repertoire are placed in what is known as the Basic Multilingual Plane (BMP) - which is entirely accessible with a single 16-bit code unit. In cases in which you need to access characters that are not in the BMP, you can combine a pair of 16-bit code units into surrogates. The pair contains a high surrogate and a low surrogate value. Unicode provides for the use of 1,024 unique high surrogates and 1,024 unique (non-overlapping) low surrogates. Together, the possible combinations of surrogates allow access of up to 1,048,544 characters using UTF-16 (in case you are doing the math, you should know that 32 characters are reserved as non-characters). So, UTF-16 is capable of representing the entire Unicode character set given the extensibility that surrogates provide. UTF-8 encoding was designed for applications that were built on 8-bit platforms that need to support Unicode. Because the first 128 characters in the Unicode repertoire precisely match those in the ASCII character set, UTF-8 affords the opportunity to maintain ASCII compatibility while significantly extending its scope. UTF-8 is a variable-width encoding form based on byte-sized code units (ranging between 1 and 4 bytes per code unit). You will occasionally run across the term octet related to Unicode. This is a term defined by the ISO/IEF 10646 standard. It is synonymous with the term byte in the Unicode Standard (an 8-bit byte). In UTF-8, the high bits of each byte are reserved to indicate where in the unit code sequence that byte belongs. A range of 8-bit code unit values are reserved to indicate the leading byte and the trailing byte in the sequence. By sequencing four bytes to represent a code unit, UTF-8 is able to represent the entire Unicode character repertoire. You will occasionally see reference to an encoding form labeled as UTF-7. This is a specialized form (more of a derivative) that ensures itself fully compliant with ASCII for specialized applications such as email systems that are only designed to handle ASCII data (the 8th bit is always equal to 0 to ensure no loss of data). As such, it is not part of the current definition of the Unicode Standard. See figure 1 for an illustration of how these encoding forms represent data. Character Encoding Schemes Now that we know how the Unicode coded character set and the character encoding forms are constructed, let's evaluate the final element of the Unicode framework. A character encoding scheme provides reversible transformations between sequences of code units and sequences of bytes. In other words, the encoding scheme serializes the code units into sequences of bytes that can be transmitted and stored as computer data. When it comes to representing large data types, bytes must be combined. How bytes are combined to represent large data types varies depending on a computer's architecture and the operating system running on top of it. There are two primary methods used to establish the byte order: big-endian (BE, meaning Most Significant Byte first) and little-endian (LE, meaning Most Significant Byte last). See Figure 2 for a synopsis of these encoding schemes. Intel microprocessor architecture generally provides native support for little-endian byte-order (as do most Intel-compatible systems). Many RISC-based processors natively support big-endian byte ordering. Some solutions are designed to support either method (such as PowerPC). UTF-16 and UTF-32 provide the means to support either byte sequencing method. This issue is not relevant with UTF-8 because it utilizes individual bytes that are encapsulated with the sequencing data (with bounded look ahead). The byte order can be indicated with an internal file signature using the byte order mark (BOM) U+FEFF. A BOM is not only not needed in UTF-8, it also destroys ASCII-transparency (yet some development tools automatically include a BOM when saving a file using UTF-8 encoding). Figure 3 illustrates the seven Unicode encoding schemes. To learn more about character encoding, see the Unicode Technical Report #17 at http://www.unicode.org/reports/tr17/. Choosing an Encoding Solution In developing for the Web, most of your choices for Unicode encoding schemes will have already been made for you when you select a protocol or technology. Yet, you may find instances in which you will have the freedom to select which scheme to use for your application (when developing customized application, API's, etc.). When transferring Unicode data to a Web client, such as a browser, generally you will want to use UTF-8. This is because ASCII compatibility carries a high value in the multi-platform world of the Web. As such, HTML and current versions of Internet Explorer running on Windows 2000 or later use the UTF-8 encoding form. If you try to force UTF-16 encoding on IE, you will encounter an error or it will default to UTF-8 anyway. Windows NT and later as well as SQL Server 7 and 2000, XML, Java, COM, ODBC, OLEDB and the .NET framework are all built on UTF-16 Unicode encoding. For most applications, UTF-16 is the ideal solution. It is more efficient than UTF-32 while generally providing the same character support scope. There are cases where UTF-32 is the preferred choice. If you are developing an application that must perform intense processing or complex manipulation of byte-level data, the fixed-width characteristic of UTF-32 can be a valuable asset. The extra code and processor bandwidth required to accommodate variable-width code units can outweigh the cost of using 32-bits to represent each code unit. In such cases, the internal processing can be done using UTF-32 encoding and the results can be transmitted or stored in UTF-16 (since Unicode provides lossless transformation between these encoding forms - although there are technical considerations that need to be understood before doing so, which are beyond the scope of this article). For a more detailed explanation of Unicode, see the Unicode Consortium's article The Unicode(r) Standard: A Technical Introduction (http://www.unicode.org/standard/principles.html) as well as Chapter 2 of the Unicode Consortium's The Unicode Standard, Version 4.0 (http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G11178). If you have specific questions about Unicode, I recommend joining the Unicode Public email distribution list at http://www.unicode.org/consortium/distlist.html.
This archive was generated by hypermail 2.1.5 : Tue Aug 17 2004 - 22:30:14 CDT