Re: Names for UTF-8 with and without BOM

From: John Cowan (jcowan@reutershealth.com)
Date: Sat Nov 02 2002 - 19:45:47 EST

Next message: Stefan Persson: "Re: Header Reply-To"

Previous message: Tex Texin: "Re: Names for UTF-8 with and without BOM"
In reply to: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Next in thread: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Reply: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Tex Texin scripsit:

> So when the parser gets JOECODE, I can understand ignoring the signature
> and autodetection, but exactly how does it find the first "<"?

Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might
be UTF-32 big-endian, but we'll suppose the parser can't handle that).
JOECODE is what's left. At worst it is in some other encoding and/or
not well-formed, in which case you expect an error and you get one.
Of course the processor knows that "<" is encoded as 0xFF in JOECODE....

The point is that signatures don't decode to a character: processors in
general, not just XML processors, are expected to skip them.

> It must have to try all of the encodings known to it... ugh.

In such a bad case, that's all you can do.

-- 
John Cowan  jcowan@reutershealth.com  www.reutershealth.com  www.ccil.org/~cowan
Promises become binding when there is a meeting of the minds and consideration
is exchanged. So it was at King's Bench in common law England; so it was
under the common law in the American colonies; so it was through more than
two centuries of jurisprudence in this country; and so it is today. 
       --_Specht v. Netscape_

Next message: Stefan Persson: "Re: Header Reply-To"
Previous message: Tex Texin: "Re: Names for UTF-8 with and without BOM"
In reply to: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Next in thread: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Reply: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Nov 02 2002 - 20:21:52 EST