RE: Unicode, UTF-8 and Extended 8-Bit Ascii - Help Needed

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Tue Jul 10 2001 - 11:36:06 EDT


Hi Stephen,

The short answer to your question is "no". The characters between U+0080 and
U+00FF *are* supported by UTF-8 (all Unicode characters are supported by
UTF-8), but they do not use the same code points as Latin-1. If they used
the same code points as Latin-1, they would *be* Latin-1 and there would be
no way to represent the other 1.4 million potential code points in Unicode
;-)

UTF-8 is (7-bit) ASCII compatible, so an ASCII character is itself in UTF-8.
However, all other characters in UTF-8 are represented by a two-, three-, or
four-byte sequence. So the Latin-1 characters in Unicode (above 0x80) are
all represented by two byte sequences.

Now, you might notice two things about your problem.

First: if you pass UTF-8 through a system that expects Latin-1 (and which
will tolerate characters in the C1 control range between 0x80 and 0x9F), you
can usually pass the UTF-8 through and recover it on the "far end". In fact,
this was one of the original design goals of UTF-8.

Second: the reverse is *not* true. It is extremely unlikely, due to the very
specific patterning in UTF-8, that a UTF-8 system will pass Latin-1 cleanly.
This is the situation that you describe below.

Luckily, you can probably still pass your EDIFACT documents successfully
even though your mailer is being converted to use UTF-8 as a default. That's
because your EDI documents are likely to be file attachments and each one
can have its own Content-Type header. Your mailer will be applying a
Transfer-Encoding Scheme (think "base64") to your document to make it 7-bit
clean anyway. As long as your code labels the content with its correct
encoding (by calling the API correctly) you can transfer the document
successfully using Latin-1 even though the message body of a message you
compose would normally be in UTF-8. See RFC 1341 and 1342 for the details on
how such stuff is labeled.

Best Regards,

Addison

===============================================================
Addison P. Phillips Manager, Globalization Engineering
webMethods, Inc. Globalization Architect
+1.408.962.5487 (tel.) mailto:aphillips@webmethods.com
+1 408.210.3569 (mobile) +1 408.962.5329 (fax)
===============================================================
"Internationalization is not a feature. It is an architecture."

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Stephen Cowe - Sun Scotland
Sent: Tuesday, July 10, 2001 3:53 AM
To: unicode@unicode.org
Subject: Unicode, UTF-8 and Extended 8-Bit Ascii - Help Needed

Hi Unicoders,

I am new to the list and would be really grateful if you could help me out
here.

I am trying to discover if the "extended latin" 8-bit ascii (decimal
values 128-255, Hex A0-FF), i.e. ISO-8859-1 are supported by UTF-8, and
if so, are the values the same.

The reason why I am asking this is because our EDIFACT EDI system
requires to send extended latin European characters (using the UNOC version
3
syntax identifier) and our global internal messaging system is being
converted
to UTF-8.

I have had a good search of the Unicode web-site but do not seem to be able
to
find the answer, yes or no, that I require.

I look forward to hearing from you, kind regards,

Stephen Cowe.

eCommerce Technologist
GSO IT EDI/EDE
+44 (0)1506 672541 (Tel)
+44 (0)1506 672893 (Fax)
stephen.cowe@sun.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 10:32:59 EDT