MSDN Article, Second Draft

From: John Tisdale (jtisdale@ocean.org)
Date: Tue Aug 17 2004 - 22:27:59 CDT

  • Next message: Nitin Kapoor: "Re: Arabic Implementation"

    Thanks everyone for your helpful feedback on the first draft of the MSDN
    article. I couldn't fit in all of the suggestions as the Unicode portion is
    only a small piece of my article. The following is the second draft based on
    the corrections, additional information and resources provided.

    Also, I would like to get feedback on the most accurate/appropriate term/s
    for describing the CCS, CEF and CES (layers, levels, components, etc.)?

    I am under a tight deadline and need to collect any final feedback rather
    quickly before producing the final version.

    Special thanks to Asmus for investing a lot of his time to help.

    Thanks, John

    --
    Unicode Fundamentals
    A considerable amount of misinformation on Unicode has proliferated among
    developers and the problem has largely compounded over time. Deploying an
    effective Unicode solution begins with a solid understanding of the
    fundamentals. It should be noted that Unicode is a far too complex topic to
    cover in any depth here. Additional resources will be given to take the
    reader beyond the scope of this article.
    Early character sets were very limited in scope. ASCII required only 7 bits
    to represent its repertoire of 128 characters. ANSI pushed this scope 8 bits
    which represented 256 characters while providing backward compatibility with
    ASCII. Countless other character sets emerged that represented the
    characters needed by various languages and language groups. The growing
    complexities of managing numerous international character sets escalated the
    need for a much broader solution that represented the characters of
    virtually all written languages in a single character set.
    Two standards emerged about the same time to address this demand. The
    Unicode Consortium published the Unicode Standard and the International
    Organization for Standardization (ISO) offered the ISO/IEF 10646 standard.
    Fortunately, these two standards bodies synchronized their character sets
    some years ago and continue to do so as new characters are added.
    Yet, although the character sets are mapped identically, the standards for
    encoding them vary in many ways (which are beyond the scope of this
    article). It should be noted that if you implement Unicode you have fully
    implemented ISO 10646, but the inverse isn't necessarily the case as Unicode
    provides more granularity and restrictions in its standards (i.e., character
    semantics, normalization, bi-directionality, etc.).
    In most cases when someone refers to Unicode they usually are discussing the
    collective offerings of these two standards bodies (whether they realize it
    or not). Formally, these two are distinct standards, but the differences are
    not relevant for the purposes of this article. So, I will use the term
    Unicode in a generic manner to refer to these collective standards.
    The design constraints for Unicode were demanding. Consider that if all of
    the world's characters were placed into a single repertoire, it could have
    required 32 bits for encoding. Yet, this parameter would have made the
    solution impractical for most computing applications. The solution had to
    provide broad character support while offering considerable flexibility for
    encoding its characters in different environments and applications.
    To meet this challenge Unicode was designed with three unique layers or
    components. Understanding the distinctions of these components is critical
    to leveraging Unicode and deploying effective solutions. They are the coded
    character set (CCS), character encoding forms (CEF) and character encoding
    schemes (CES).
    In brief, the coded character set contains all of the characters in Unicode
    along with a corresponding integer by which each is referenced. Unicode
    provides three character encoding forms for transforming the character
    references into computer readable data. The character encoding scheme
    establishes how the pieces of data in the encoding form are serialized so
    that they can be transmitted and stored.
    Coded Character Sets
    The fact that so many developers suggest that Unicode is a 16-bit character
    set illustrates how much Unicode is misunderstood (and how these three
    layers are not differentiated). The truth is that the Unicode character set
    can be encoded using 8, 16 or 32 bits (if you get nothing else out of this
    article than that, at least get that point straight - and help play a role
    in the demise of this ruse by passing it on to others).
    A coded character set (sometimes called a character repertoire) is a mapping
    from a set of abstract characters to a set of nonnegative, noncontiguous
    integers (between 0 and 1,114,111, called code points). The Unicode Standard
    contains one and only one coded character set (which is precisely
    synchronized with the ISO/IEF 10646 character set). This character set
    contains most characters in use by most written languages (including some
    dead languages) along with special characters used for mathematical and
    other specialized applications.
    Each character in Unicode is represented by a code point. These
    integer-based values are typically notated as U + hexadecimal code point.
    Each code point represents a given character in the Unicode character
    repertoire. For example, the English uppercase letter A is represented as
    U+0041.
    If you are using Windows 2000, XP or 2003, you can run the charmap utility
    to see how characters are mapped in Unicode on your system. These operating
    systems are built on UTF-16 encoded Unicode.
    Character Encoding Forms
    The second component in Unicode is character encoding forms. Their purpose
    is to map sets of code points contained in the character repertoire to
    sequences of code units that can be represented in a computing environment
    (using fixed-width or variable-width coding).
    The Unicode Standard provides three forms for encoding its repertoire
    (UTF-8, UTF-16 and UTF-32). You will often find references to USC-2 and
    USC-4. These are competing encoding forms offered by ISO/IEF 10646 (USC-2 is
    equivalent to UTF-16 and USC-4 to UTF-32). I will not discuss the
    distinctions and merits of each and will suggest simply that most
    implementations today use UTF-16 and UTF-32 (even though some are
    occasionally mislabeled as USC-2 and USC-4).
    As you might expect, UTF-32 is a 32-bit encoding form. That is, each
    character is encoded using 4 bytes and 4 bytes only. Although this method
    provides a fixed-width means of encoding, the overhead in terms of wasted
    system resources (i.e. memory, disk space, transmission bandwidth) is
    significant enough to limit its implementation (as at least half of the 32
    bits will contain zeros in the majority of applications). Except in some
    UNIX operating systems and specialized applications with specific needs,
    UTF-32 is seldom implemented as an end-to-end solution (yet it does have its
    strengths in certain applications).
    UTF-16 is the default means of encoding the Unicode character repertoire
    (which has perhaps played a role in the misnomer that Unicode is a 16-bit
    character set). As you might expect, it is based on a 16-bit encoding form
    (each code unit is represented by 2 bytes). But, this isn't the whole story.
    Since 16-bits are capable of accessing only 65,536 characters, you might
    guess that if you needed to access more characters than that you would be
    forced to use UTF-32. But, this isn't the case. UTF-16 has the ability to
    combine pairs of 16-bit code units in cases in which 16 bits are inadequate.
    When code units are paired in this manner they are called surrogates.
    In most cases a single 16-bit code unit is adequate because the most
    commonly used characters in the repertoire are placed in what is known as
    the Basic Multilingual Plane (BMP) - which is entirely accessible with a
    single 16-bit code unit. In cases in which you need to access characters
    that are not in the BMP, you can combine a pair of 16-bit code units into
    surrogates. The pair contains a high surrogate and a low surrogate value.
    Unicode provides for the use of 1,024 unique high surrogates and 1,024
    unique (non-overlapping) low surrogates. Together, the possible combinations
    of surrogates allow access of up to 1,048,544 characters using UTF-16 (in
    case you are doing the math, you should know that 32 characters are reserved
    as non-characters). So, UTF-16 is capable of representing the entire Unicode
    character set given the extensibility that surrogates provide.
    UTF-8 encoding was designed for applications that were built on 8-bit
    platforms that need to support Unicode. Because the first 128 characters in
    the Unicode repertoire precisely match those in the ASCII character set,
    UTF-8 affords the opportunity to maintain ASCII compatibility while
    significantly extending its scope.
    UTF-8 is a variable-width encoding form based on byte-sized code units
    (ranging between 1 and 4 bytes per code unit). You will occasionally run
    across the term octet related to Unicode. This is a term defined by the
    ISO/IEF 10646 standard. It is synonymous with the term byte in the Unicode
    Standard (an 8-bit byte).
    In UTF-8, the high bits of each byte are reserved to indicate where in the
    unit code sequence that byte belongs. A range of 8-bit code unit values are
    reserved to indicate the leading byte and the trailing byte in the sequence.
    By sequencing four bytes to represent a code unit, UTF-8 is able to
    represent the entire Unicode character repertoire.
    You will occasionally see reference to an encoding form labeled as UTF-7.
    This is a specialized form (more of a derivative) that ensures itself fully
    compliant with ASCII for specialized applications such as email systems that
    are only designed to handle ASCII data (the 8th bit is always equal to 0 to
    ensure no loss of data). As such, it is not part of the current definition
    of the Unicode Standard. See figure 1 for an illustration of how these
    encoding forms represent data.
    Character Encoding Schemes
    Now that we know how the Unicode coded character set and the character
    encoding forms are constructed, let's evaluate the final element of the
    Unicode framework. A character encoding scheme provides reversible
    transformations between sequences of code units and sequences of bytes. In
    other words, the encoding scheme serializes the code units into sequences of
    bytes that can be transmitted and stored as computer data.
    When it comes to representing large data types, bytes must be combined. How
    bytes are combined to represent large data types varies depending on a
    computer's architecture and the operating system running on top of it. There
    are two primary methods used to establish the byte order: big-endian (BE,
    meaning Most Significant Byte first) and little-endian (LE, meaning Most
    Significant Byte last). See Figure 2 for a synopsis of these encoding
    schemes.
    Intel microprocessor architecture generally provides native support for
    little-endian byte-order (as do most Intel-compatible systems). Many
    RISC-based processors natively support big-endian byte ordering. Some
    solutions are designed to support either method (such as PowerPC).
    UTF-16 and UTF-32 provide the means to support either byte sequencing
    method. This issue is not relevant with UTF-8 because it utilizes individual
    bytes that are encapsulated with the sequencing data (with bounded look
    ahead). The byte order can be indicated with an internal file signature
    using the byte order mark (BOM) U+FEFF. A BOM is not only not needed in
    UTF-8, it also destroys ASCII-transparency (yet some development tools
    automatically include a BOM when saving a file using UTF-8 encoding). Figure
    3 illustrates the seven Unicode encoding schemes.
    To learn more about character encoding, see the Unicode Technical Report #17
    at http://www.unicode.org/reports/tr17/.
    Choosing an Encoding Solution
    In developing for the Web, most of your choices for Unicode encoding schemes
    will have already been made for you when you select a protocol or
    technology. Yet, you may find instances in which you will have the freedom
    to select which scheme to use for your application (when developing
    customized application, API's, etc.).
    When transferring Unicode data to a Web client, such as a browser, generally
    you will want to use UTF-8. This is because ASCII compatibility carries a
    high value in the multi-platform world of the Web. As such, HTML and current
    versions of Internet Explorer running on Windows 2000 or later use the UTF-8
    encoding form. If you try to force UTF-16 encoding on IE, you will encounter
    an error or it will default to UTF-8 anyway.
    Windows NT and later as well as SQL Server 7 and 2000, XML, Java, COM, ODBC,
    OLEDB and the .NET framework are all built on UTF-16 Unicode encoding. For
    most applications, UTF-16 is the ideal solution. It is more efficient than
    UTF-32 while generally providing the same character support scope.
    There are cases where UTF-32 is the preferred choice. If you are developing
    an application that must perform intense processing or complex manipulation
    of byte-level data, the fixed-width characteristic of UTF-32 can be a
    valuable asset. The extra code and processor bandwidth required to
    accommodate variable-width code units can outweigh the cost of using 32-bits
    to represent each code unit.
    In such cases, the internal processing can be done using UTF-32 encoding and
    the results can be transmitted or stored in UTF-16 (since Unicode provides
    lossless transformation between these encoding forms - although there are
    technical considerations that need to be understood before doing so, which
    are beyond the scope of this article).
    For a more detailed explanation of Unicode, see the Unicode Consortium's
    article The Unicode(r) Standard: A Technical Introduction
    (http://www.unicode.org/standard/principles.html) as well as Chapter 2 of
    the Unicode Consortium's The Unicode Standard, Version 4.0
    (http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G11178).
    If you have specific questions about Unicode, I recommend joining the
    Unicode Public email distribution list at
    http://www.unicode.org/consortium/distlist.html.
    
    




    This archive was generated by hypermail 2.1.5 : Tue Aug 17 2004 - 22:30:14 CDT