Microsoft Unicode Article Review

From: John Tisdale (jtisdale@ocean.org)
Date: Thu Aug 05 2004 - 15:52:07 CDT

Next message: Magda Danish \(Unicode\): "New translation."

Previous message: Peter Kirk: "Re: Processing of default ignorable code points"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Microsoft Unicode Article Review"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Microsoft Unicode Article Review"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I'm in the early stages of writing an article for Microsoft for publication
on Developing Multilingual Web Sites. I want to include a brief overview of
Unicode. I must say that I read a lot of contradictory information on the
topic online from various sources. I've done my best to differentiate fact
from fiction so that I can provide readers with an accurate introduction to
the topic.

I would really appreciate some review of the following segment of the
article (in first draft form) for accuracy. Any technical corrections or
general enhancements that anyone may wish to offer would be much
appreciated. Please be gentle in dispensing criticism as this is just a
starting point.

Feel free to respond directly to me at jtisdale@ocean.org (as this topic
probably doesn't warrant group discussion and bandwidth).

Thanks very much, John

Unicode Fundamentals

For our discussion, there are two fundamental terms with which you must be
familiar. First, a character set or character repertoire is an organized
collection of characters (the term character set is more common but
character repertoire is more technically correct, as is coded character
set). Second, an encoding scheme is a system for representing those
characters in a computing environment. Distinguishing between these two
terms is crucial to understanding how to leverage the benefits of Unicode.

Before Unicode, the majority of character sets contained only those
characters needed by a single language or a small group of associated
languages (such as iso-8859-2 which contains characters used in various
European languages). The popularization of the Internet elevated the need
for a more universal character set.

In 1989, the International Organization for Standardization (ISO) published
the first draft of a character set standard that supported a broad range of
languages. It was called the ISO/IEC 10646 standard or the Universal
Multiple-Octet Coded Character Set (commonly referred to as the Universal
Character Set or UCS).

Around the same time, a group of manufacturers in the U.S. formed the
Unicode Consortium with a similar goal of creating a broad multilingual
character set standard. The result of their work was the formation of the
Unicode Standard. Since the early releases of these two standards, both
groups have worked together closely to ensure compatibility between their
standards. For details on the development of these standards, see
http://www.unicode.org/versions/Unicode4.0.0/appC.pdf.

In most cases when someone refers to Unicode they usually are discussing the
collective offerings of these two standards bodies (whether they realize it
or not). Technically, this isn't accurate but it certainly does simplify the
discussion. In this article, I will sometimes use the term Unicode in a
generic manner to refer to these collective standards (with apologies to
those offended by this generalization) and when applicable I will make
distinctions between them (referring to the Unicode Standard as Unicode and
the ISO/IEC 10646 standard as UCS).

On Character Sets and Encoding Schemes

First, you should recognize that both of these standards separate the
character repertoire from the encoding scheme. Many people confuse this fact
and suggest that Unicode is a 16-bit character set. Yet, in neither standard
is this accurate. The number of bits used is not associated with the
character set but with the encoding scheme. Character sets are based on code
points (a string of hexadecimal numbers) and not bits and bytes. So, to say
that Unicode is represented by any number of bits is not correct. If you
want to talk about bits and bytes, you need to talk about encoding schemes.

Each character in Unicode is represented by a code point. It is usually
notated as U + hexadecimal code point. The U stands for Unicode, followed by
the + sign, and then a hexadecimal number (the code point) representing a
given character in the Unicode character repertoire. For example, the
English uppercase letter A is represented as U+0041.

One way of encoding this character would be with UTF-8. This scheme would
encode this character as 0x41. Encoding this same character using UCS-2
produces 0x00, 0x41. You can run the Windows charmap utility (if you are
running Windows 2000, XP or 2003) to see how characters are mapped in
Unicode in your system.

Basically, the Unicode Standard and the UCS character repertoires are the
same (for practical purposes). Whenever one group publishes a new version of
their standard, the other eventually releases a corresponding one. For
example the Unicode Standard, Version 4.0 is the same as ISO/IEC 10646:2003.
Hence, the code points are synchronized between these two standards.

So, the differences between these two standards are not with the character
sets themselves, but with the standards they offer for encoding and
processing the characters contained therein. Both standards provide multiple
encoding schemes (each with their unique characteristics). A term frequently
used in encoding scheme definitions is an octet. This term describes an
8-bit byte.

UCS provides two encoding schemes. UCS-2 uses two octets (or 16 bits) and
UCS-4 uses four octets (or 32 bits) to encode characters. Unicode has three
primary encoding schemes UTF-8, UTF-16 and UTF-32. UTF stands for Unicode
(or UCS) Transformation Format. Although you will occasionally see someone
refer to UTF-7, this is a specialized form (more of a derivative) that
ensures itself fully compatible with ASCII for specialized applications such
as email systems that are not designed to handle non-ASCII data. As such, it
is not part of the current definition of the Unicode Standard.

One of the differences between Unicode and UCS encoding schemes is that the
former provides variable-width encoding lengths and the latter does not.
That is, UCS-2 is encoded with 2 bytes only and UCS-4 with 4 bytes only.
Based on the naming convention, some people assume that UTF-8 is a single
byte coding scheme. But, this isn't the case. UTF-8 actually provides
variable lengths from 1 to 4 octets. Additionally, UTF-16 can encodes
characters in either 2-octet or 4-octet lengths. UTF-32 can only encode with
four octets.

Also, you should be aware that the byte order can differ with UCS-2, UTF-16,
UCS-4 and UTF-32. The two variations are known as Big-endian (BE) and
Little-endian (LE). The name Big-endian is derived from the term Big End In
(meaning Most Significant Byte first) and Little-endian comes from Little
End In (meaning Most Significant Byte last). See Figure 1 for a synopsis of
these encoding schemes.

Choosing a Unicode Encoding Scheme

In developing for the Web, most of your choices for Unicode encoding schemes
will have already been made for you when you select a protocol or
technology. Yet, you may find instances in which you will have the freedom
to select which scheme to use for your application (especially in customized
desktop applications). In such cases, there are several dynamics that will
influence your decision.

First, it should be stated that there isn't necessarily a right or wrong
choice when it comes to a Unicode encoding scheme. In general, when choosing
between the UCS and Unicode standards, the former tends to provide more
generalized parameters about encoding and decoding whereas Unicode tends to
provide more granularity, precision, and restriction in its standards. So,
which you choose may depend upon how much precise definition you want versus
freedom in tailoring the standard to your application.

The variable length capability UTF-8 may empower you with greater
flexibility in your application (to vary the number of octets you need for a
particular application). Yet, if you are designing an application that needs
to parse Unicode at the byte-level, the variable length of UTF-8 will
require much more complex algorithms than the fixed length encoding schemes
of UCS (granted you could use the fixed length of UTF-32 but if you don't
need to encode more than 65,536 characters, you are wasting twice as much
space than by using the fixed length UCS-2 scheme).

Because UTF-8 supports 8-bit encoding and Unicode's character mapping
precisely matches that of ASCII for the first 128 characters, UTF-8 affords
you the ability to have Unicode and ASCII compatibility at the same time
(talk about having your cake and eating it too). So, for cases in which
maintaining ASCII compatibility is highly valued, UTF-8 makes an obvious
choice. This is one of the primary reasons that Active Server Pages and
Internet Explorer use UTF-8 encoding for Unicode.

Yet, if you are working with an application that must parse and manipulate
text at the byte-level, the costliness of variable length encoding will
probably outweigh the benefits of ASCII compatibility. In such a case the
fixed length of UCS-2 will usually prove the better choice. This is why
Windows NT and subsequent Microsoft operating systems, SQL Server 7 (and
subsequent ones), XML, Java, COM, ODBC, OLEDB and the .NET framework are all
built on UCS-2 Unicode encoding. The uniform length of UCS provides a good
foundation when it comes to complex data manipulation.

If, on the other hand, you are creating an application that needs to display
multiple Asian languages at the same time, UTF-32 or UCS-4 may be your only
option (because the combination of characters found in these languages
usually exceeds the 65,536 limitations of a 16-bit encoding scheme).

There are other technical differences between these standards that you may
want to consider that are beyond the scope of this article (such as how
UTF-16 supports surrogate pairs but UCS-2 does not). For a more detailed
explanation of Unicode, see the Unicode Consortium's article The UnicodeR
Standard: A Technical Introduction
(http://www.unicode.org/standard/principles.html) as well as Chapter 2 of
the Unicode Consortium's The Unicode Standard, Version 4.0
(http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G11178).

The separation that Unicode provides between the character set and the
encoding scheme allows you to choose the smallest and most appropriate
encoding scheme for referencing all of the characters you need for a given
application (thus providing considerable power and flexibility). Unicode is
an evolving standard that continues to be tweaked and elaborated upon.

Next message: Magda Danish \(Unicode\): "New translation."
Previous message: Peter Kirk: "Re: Processing of default ignorable code points"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Microsoft Unicode Article Review"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Microsoft Unicode Article Review"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Aug 05 2004 - 16:24:37 CDT