From: Doug Ewell (dewell@roadrunner.com)
Date: Wed May 28 2008 - 22:06:09 CDT
Peter Johansson wrote:
> Is the Unicode-encoded character string self-descriptive? That is, do
> I need a priori knowledge that it is encoded as, for example, UTF-8
> rather than UTF-32? Or, by examining the first byte (or first few
> bytes) can I determine the encoding?
The approach taken in Appendix A of the XML specification
("Autodetection of Character Encodings") might be of interest:
http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing
An XML parser does have the distinct advantage in this case of knowing
what the first few "real" characters are supposed to be. The problem is
harder to solve for arbitrary text, but not unreasonably so, and in any
case most text isn't completely arbitrary.
> I didn't see anything on this topic in the FAQ.
That does surprise me, considering the great deal of related information
on the "UTF-8, UTF-16, UTF-32 & BOM" page.
-- Doug Ewell * Arvada, Colorado, USA * RFC 4645 * UTN #14 http://www.ewellic.org http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 22:09:34 CDT