From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Jan 14 2004 - 12:25:13 EST
I'm not sure which "one suggested heuristic method" you are referring to, but
you are bounding to conclusions. For example, one of the heuristics is to judge
what are more common characters when bytes are interpreted as if they were in
different encoding schemes. When picking between UTF16-BE and LE, U+0020 is
*still* much more common than U+2000, even in Thai.
Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄
----- Original Message -----
From: "Peter Kirk" <peterkirk@qaya.org>
To: "John Burger" <john@mitre.org>
Cc: <unicode@unicode.org>
Sent: Wed, 2004 Jan 14 08:12
Subject: Re: Detecting encoding in Plain text
> On 14/01/2004 07:16, John Burger wrote:
>
> > ...
> > By the way, I still don't quite understand what's special about Thai.
> > Could someone elaborate?
> >
> I mentioned Thai because it is the only language I know of which does
> not used SPACE, U+0020. It also has at least some of its own
> punctuation. So a Thai text need not include any characters U+00xx -
> which rules out one suggested heuristic method.
>
> --
> Peter Kirk
> peter@qaya.org (personal)
> peterkirk@qaya.org (work)
> http://www.qaya.org/
>
>
>
>
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 12:57:56 EST