Re: Detecting encoding in Plain text

From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Jan 14 2004 - 07:33:34 EST

  • Next message: Peter Kirk: "Re: New MS Mac Office and Unicode?"

    On 13/01/2004 18:05, D. Starner wrote:

    >Peter Kirk writes:
    >
    >
    >>I agree that heuristics should be adjusted for Thai. But problems may
    >>arise if they have to be adjusted individually, and without regression
    >>errors, for all 6000+ world languages.
    >>
    >>
    >
    >Thai is hard because of the writing system. But most writing systems weren't
    >encoded pre-Unicode, so if they were typed into a computer, it was with
    >a Latin (or Cyrillic?) transliteration that probably used spaces and new lines,
    >and in fact was probably ASCII.
    >
    >More cynically, those who use obscure character sets or font encodings have
    >trouble viewing them; that is one of the reasons for Unicode. That this tool
    >may to some extent be an example of that problem is a simple fact of life,
    >and doesn't call for it to be thrown out.
    >
    >

    Either you are confused or I am. I was not referring to pre-Unicode
    legacy encodings. I was referring to Unicode plain text data which may
    (when Unicode includes all the necessary characters) be in any one of
    6000+ languages, some of which have a variety of scripts and spelling
    conventions. The problem is not that people are using obscure legacy
    encodings, but that they are not defining their UTF adequately.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 08:22:06 EST