Re: Encoding designation in Java Script sites

From: Addison Phillips [GSC] (addison@globalsight.com)
Date: Tue Apr 11 2000 - 17:52:04 EDT


Both XML and DOM are UTF-16 centric....for some pretty good implementation
reasons... which is why I pointed out the use of UTF-16 internally to the
parsers. I suggest UTF-8 as a storage medium for a variety of reasons,
mostly to do with differences in client and file-server processor
architecture, database support for Unicode, and other file-centric reasons.
The parser will decode UTF-8 (or any encoding that it supports and that you
declare) into its internal format, usually UTF-16 or UCS-4.

As for the BOM... it was my morning for typos.

Addison
----- Original Message -----
From: Markus Scherer <markus.scherer@jtcsv.com>
To: Unicode List <unicode@unicode.org>
Sent: Tuesday, April 11, 2000 2:03 PM
Subject: Re: Encoding designation in Java Script sites

> "Addison Phillips [GSC]" wrote:
> > what "XML is in Unicode" *means* in terms of actual disk file encoding
or
> > internal parsing... it turns out that most parsers use UCS-4 or UTF-16
in
> > their rendering engine and smart implementers use UTF-8 when storing the
> > actual XML files on disk. Yes, you have to declare the encoding for
UTF-8.
> > Byte Order Marks--0xFFFE--are the order of the day for UTF-16 files].
> >
>
> the byte order mark is U+feff.
>
> i believe that the xml (or dom?) specification also makes xml
utf-16-centric: utf-8 is one of the two default encodings (utf-8 & utf-16),
but text offsets are defined in terms of utf-16 code units, as far as i
know. i would expect most parsers to use utf-16 internally.
>
> markus



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT