Re: Detecting encoding in Plain text

From: D. Starner (shalesller@writeme.com)
Date: Tue Jan 13 2004 - 20:55:57 EST

  • Next message: Doug Ewell: "Re: detecting encoding in plain text (related to utf8)"

    ----- Original Message -----
    From: Peter Kirk <peterkirk@qaya.org>
    Date: Tue, 13 Jan 2004 09:03:48 -0800
    To: Doug Ewell <dewell@adelphia.net>
    Subject: Re: Detecting encoding in Plain text

    > On 13/01/2004 08:34, Doug Ewell wrote:
    >
    > >Peter Kirk <peterkirk at qaya dot org> wrote:
    > >
    > >
    > >
    > >>>If a certain Unicode plain text file uses ASCII punctuation OR spaces
    > >>>OR end-of-line characters, AND the file is not too short or has a
    > >>>very odd formatting, then the algorithm should work.
    > >>>
    > >>>
    > >>True. But there may be certain languages (perhaps Thai?) for which all
    > >>of these circumstances regularly occur together. It would be very
    > >>inconvenient for users of these languages if programs regularly
    > >>attribute the wrong encoding to their text.
    > >>
    > >>
    > >
    > >Whether this is specifically true for Thai or not -- and I doubt that
    > >the "short file or odd formatting" condition could ever be considered
    > >language-dependent -- I would say an otherwise-good heuristic that
    > >performs badly for Thai ought to have special cases built in for Thai,
    > >rather than being discarded.
    > >
    > >
    > >
    > >
    > I may have confused you with what I wrote, but my "all of these
    > circumstances" referred not to "the "short file or odd formatting"
    > condition", but to Marco's "*all* these circumstances", which you
    > snipped, which were originally:
    >
    > >Some scripts include their own digits and punctuation; not all scripts use spaces; and controls are not necessarily used, if U+2028 LINE SEPARATOR is used for new lines.
    > >
    >
    > I agree that heuristics should be adjusted for Thai. But problems may
    > arise if they have to be adjusted individually, and without regression
    > errors, for all 6000+ world languages.
    >
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    >
    >
    >

    -- 
    ___________________________________________________________
    Sign-up for Ads Free at Mail.com
    http://promo.mail.com/adsfreejump.htm
    


    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 02:12:39 EST