Perhaps OT: Mysterious escape sequences in UN data

From: John Burger (john@mitre.org)
Date: Tue Mar 31 2009 - 13:58:37 CST

  • Next message: Asmus Freytag: "Re: Perhaps OT: Mysterious escape sequences in UN data"

    Hi -

    I have some parallel Chinese-English UN proceedings scraped from the
    UN website some years ago, and further processed by the Linguistic
    Data Consortium. I think the data were originally in one of the GB
    variants, in MS Word or WordPerfect.

    The data is littered with some odd escape sequences, in both
    languages, like this:

       ... Permanent Representatives and Charg\x{5ee5} daffaires of
    Kuwait, Burundi ...
       -\x{e76f}现?常任?事国 ...

    According to the LDC README, the "\x{}" is their way of escaping
    WordPerfect encodings that they could not convert.

    I can guess what some of these are - e76f seems to occur after in
    contexts that indicate it's some kind of spacing character, perhaps a
    tab. Oddly, most of the rest seem to represent =two= characters.
    For instance 5ee5 seems to be "és":

       misleading clich\x{5ee5} that
       Mr. Andr\x{5ee5} Pastrana Arango

    Here's some others:

       highlighted by Mr. Rodr\x{74b2}uez
       issued by the Espace r\x{5ee7}ublicain
       transmitting an aide-m\x{5e66}oire issued

    These seem like odd choices for ligatures. I can correct some of
    these, but there are hundreds of different ones. Sorry if I'm
    providing insufficient information, but can anyone shed any light on
    this?

    Thanks!

    - John D. Burger
       MITRE



    This archive was generated by hypermail 2.1.5 : Tue Mar 31 2009 - 14:01:15 CST