Re: Perhaps OT: Mysterious escape sequences in UN data

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Mar 31 2009 - 15:05:46 CST

  • Next message: Peter Zilahy Ingerman, PhD: "Re: Perhaps OT: Mysterious escape sequences in UN data"

    It does look like most of your examples represent two-byte escapes with
    each byte associated with a unique character.
    5e = é
    e5 = s
    66 = m
    e7 = p
    74 = í (i with accent)
    b2 = g

    I have no suggestion that would explain the values, but they seem to be
    consistent, so it should be possible for find a proper context for each
    byte, and deal with combinations as derived from combinations of byte
    values (.i.e. as code sequences) rather than treating them as ligatures.

    A./

    On 3/31/2009 12:58 PM, John Burger wrote:
    > Hi -
    >
    > I have some parallel Chinese-English UN proceedings scraped from the
    > UN website some years ago, and further processed by the Linguistic
    > Data Consortium. I think the data were originally in one of the GB
    > variants, in MS Word or WordPerfect.
    >
    > The data is littered with some odd escape sequences, in both
    > languages, like this:
    >
    > ... Permanent Representatives and Charg\x{5ee5} daffaires of Kuwait,
    > Burundi ...
    > -\x{e76f}现?常任?事国 ...
    >
    > According to the LDC README, the "\x{}" is their way of escaping
    > WordPerfect encodings that they could not convert.
    >
    > I can guess what some of these are - e76f seems to occur after in
    > contexts that indicate it's some kind of spacing character, perhaps a
    > tab. Oddly, most of the rest seem to represent =two= characters.
    > For instance 5ee5 seems to be "és":
    >
    > misleading clich\x{5ee5} that
    > Mr. Andr\x{5ee5} Pastrana Arango
    >
    > Here's some others:
    >
    > highlighted by Mr. Rodr\x{74b2}uez
    > issued by the Espace r\x{5ee7}ublicain
    > transmitting an aide-m\x{5e66}oire issued
    >
    > These seem like odd choices for ligatures. I can correct some of
    > these, but there are hundreds of different ones. Sorry if I'm
    > providing insufficient information, but can anyone shed any light on
    > this?
    >
    > Thanks!
    >
    > - John D. Burger
    > MITRE
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Mar 31 2009 - 15:33:40 CST