From: John Burger (john@mitre.org)
Date: Tue Mar 31 2009 - 13:58:37 CST
Hi -
I have some parallel Chinese-English UN proceedings scraped from the  
UN website some years ago, and further processed by the Linguistic  
Data Consortium.  I think the data were originally in one of the GB  
variants, in MS Word or WordPerfect.
The data is littered with some odd escape sequences, in both  
languages, like this:
   ... Permanent Representatives and Charg\x{5ee5} daffaires of  
Kuwait, Burundi ...
   -\x{e76f}现?常任?事国 ...
According to the LDC README, the "\x{}" is their way of escaping  
WordPerfect encodings that they could not convert.
I can guess what some of these are - e76f seems to occur after in  
contexts that indicate it's some kind of spacing character, perhaps a  
tab.  Oddly, most of the rest seem to represent =two= characters.     
For instance 5ee5 seems to be "és":
   misleading clich\x{5ee5} that
   Mr. Andr\x{5ee5} Pastrana Arango
Here's some others:
   highlighted by Mr. Rodr\x{74b2}uez
   issued by the Espace r\x{5ee7}ublicain
   transmitting an aide-m\x{5e66}oire issued
These seem like odd choices for ligatures.  I can correct some of  
these, but there are hundreds of different ones. Sorry if I'm  
providing insufficient information, but can anyone shed any light on  
this?
Thanks!
- John D. Burger
   MITRE
This archive was generated by hypermail 2.1.5 : Tue Mar 31 2009 - 14:01:15 CST