From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Jan 14 2004 - 12:52:41 EST
On 14/01/2004 09:25, Mark Davis wrote:
>I'm not sure which "one suggested heuristic method" you are referring to, ...
>
Basically the one that in UTF-16 there are likely to be many zero bytes
in either odd or even positions.
>... but
>you are bounding to conclusions. For example, one of the heuristics is to judge
>what are more common characters when bytes are interpreted as if they were in
>different encoding schemes. When picking between UTF16-BE and LE, U+0020 is
>*still* much more common than U+2000, even in Thai.
>
>
>
Not necessarily. In certain texts neither might occur at all, so the
heuristic fails.
I agree with Mark S and others that more sophisticated methods are
likely to be safer.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 13:37:22 EST