Re: Encoding designation in Java Script sites

From: Lars Marius Garshol (
Date: Wed Apr 12 2000 - 04:11:38 EDT

* Markus Scherer
| i believe that the xml (or dom?) specification also makes xml
| utf-16-centric: utf-8 is one of the two default encodings (utf-8 &
| utf-16), but text offsets are defined in terms of utf-16 code units,
| as far as i know. i would expect most parsers to use utf-16
| internally.

There is nothing inherently UTF-16-centric about XML, since there are
no text offsets or anything like it in XML itself. Parsers do have to

 - convert strings like '䄲' to the actual character and

 - for each character in the document verify that it is within the
   allowed character ranges

However, I wouldn't really call this being UTF-16-centric.

The DOM specification, OTOH, does explicitly specify that UTF-16
should be used internally. The CharacterData interface does use
offsets, so here there is a clear UTF-16 bias. The DOM level 1 doesn't
clearly specify how to interpret these offsets, but in level 2 text
appears to the effect that these refer to 16-bit quantities rather
than characters.

--Lars M.

