From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Wed Jan 14 2004 - 18:34:17 EST
Consider CR and LF too.
Mark Davis wrote on 1/14/2004, 9:25 AM:
> I'm not sure which "one suggested heuristic method" you are referring
> to, but
> you are bounding to conclusions. For example, one of the heuristics is
> to judge
> what are more common characters when bytes are interpreted as if they
> were in
> different encoding schemes. When picking between UTF16-BE and LE,
> U+0020 is
> *still* much more common than U+2000, even in Thai.
>
> Mark
> __________________________________
> http://www.macchiato.com
> ► शिष्यादिच्छेत्पराजयम् ◄
>
> ----- Original Message -----
> From: "Peter Kirk" <peterkirk@qaya.org>
> To: "John Burger" <john@mitre.org>
> Cc: <unicode@unicode.org>
> Sent: Wed, 2004 Jan 14 08:12
> Subject: Re: Detecting encoding in Plain text
>
>
> > On 14/01/2004 07:16, John Burger wrote:
> >
> > > ...
> > > By the way, I still don't quite understand what's special about Thai.
> > > Could someone elaborate?
> > >
> > I mentioned Thai because it is the only language I know of which does
> > not used SPACE, U+0020. It also has at least some of its own
> > punctuation. So a Thai text need not include any characters U+00xx -
> > which rules out one suggested heuristic method.
> >
> > --
> > Peter Kirk
> > peter@qaya.org (personal)
> > peterkirk@qaya.org (work)
> > http://www.qaya.org/
> >
> >
> >
> >
>
>
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 19:02:26 EST