Re: Playing with Unicode (was: Re: UTF-17)

From: Lars Marius Garshol (larsga@garshol.priv.no)
Date: Mon Jun 25 2001 - 10:28:06 EDT


* Marco Cimarosti
|
| 1) UTF-8, UTF-16 and UTF-32 are the only three real EXISTING Unicode
| Transformation Formats. They are official and part of the Unicode standard.

* Elliotte Rusty Harold
|
| What about ISO-10646-UCS-2 and ISO-10646-UCS-4 as used in XML? Where
| do they fit in? Are they only part of ISO-10646 and not Unicode? or
| are they identical to UTF-16 and UTF-32? or something else?

UCS-2 and UCS-4 are defined in ISO 10646 and so not part of Unicode,
although the character set they encode, ISO 10646, is identical to
Unicode.

This UTR has some more useful information on this:
  <URL: http://www.unicode.org/unicode/reports/tr19/#10646 >

UCS-4 and UTF-32 are basically the same, except that the one is an ISO
10646 encoding, and the other a Unicode encoding. This means that
there are slightly different expectations to the characters found in
files with the one encoding from those to characters found in the
other, but in practice you can regard them as being the same encoding.

UCS-2 and UTF-16 are not the same, however. UCS-2 is basically UTF-16
without surrogate support, so it can only encode the characters in the
BMP, and nothing above U+FFFF. Many systems which purport to support
UTF-16 actually only implement UCS-2.

--Lars M.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT