From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Wed Oct 26 2005 - 12:20:32 CST
On Wed, 26 Oct 2005, Velasquez, Carlos wrote:
> I am new to this list and somewhat new to the Unicode standard.
Welcome. You might find the FAQ at http://www.unicode.org useful, since it
addresses some of the questions you asked. Admittedly, the FAQ is partly
rather hard reading. I hope you can find a suitable tutorial.
> I am hoping someone can help me understand the difference between ANSI
> and UTF-8 for characters in the domain of x00 and xFF.
First, the abbreviation "ANSI", when used to denote a character code,
is a misnomer. There was once a draft by the American National Standards
Institute. Microsoft created its own version of "Latin 1" and started
calling it "ANSI", but the ANSI never approved it. The Microsoft code
commonly called "ANSI" is properly called "windows-1252" (the official
MIME encoding name) or "Windows Latin 1" (a common descriptive name).
Second, UTF-8 is one of the encodings you can use for Unicode (and often
the best choice). It might be confusing to compare it with windows-1252.
_Logically_, Unicode assigns a unique number to each character, and this
number can be physically represented in different ways - in UTF-8, you use
one to four bytes (octets) per character. For the first code positions,
these numbers are the same as in ISO-8859-1, also known as ISO Latin 1.
The difference between ISO Latin 1 and Windows Latin 1 is that characters
in code positions 128 to 159 decimal (80 to 9F in hexadecimal) are
reserved for control characters in the former, assigned (in part) to
some printable characters (mostly punctuation) in the latter.
> Are the 7 bit ASCII characters a subset of the 8 bit ANSI character?
The ASCII characters have the same numbers in ASCII, ISO Latin 1, Windows
Latin 1 ("ANSI"), and Unicode.
> I understand that the 7 bit ASCII characters are definitely a subset of
> the UTF-8 set but am not sure if ANSI is a subset of UTF-8.
UTF-8 is an encoding, thus at a different conceptual level. However,
ASCII characters are represented "as such" as 8-bit bytes (with first bit
zero) in UTF-8, whereas "ANSI" characters outside ASCII have a completely
different representation (each of them occupies at least two bytes).
> Here is why I ask:
> Our database contains name information for a Spanish population. As
> such, we store names such as "Sérgio Murilo" in our database which is
> set to Unicode UTF-8. However, when we generate files and specify the
> file encoding to be ANSI, we get the character "é" in double byte (xC3
> and xA9).
That's to be expected. The letter "é" (e with acute accent) is outside
the ASCII range, and it is represented as two octets in UTF-8. If you view
a UTF-8 encoded document in a program that interprets its input as "ANSI",
you will see two characters in place of "é".
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Wed Oct 26 2005 - 12:23:48 CST