From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Mar 31 2009 - 15:05:46 CST
It does look like most of your examples represent two-byte escapes with 
each byte associated with a unique character.
5e = é
e5 = s
66 = m
e7 = p
74 = í  (i with accent)
b2 = g
I have no suggestion that would explain the values, but they seem to be 
consistent, so it should be possible for find a proper context for each 
byte, and deal with combinations as derived from combinations of byte 
values (.i.e. as code sequences) rather than treating them as ligatures.
A./
On 3/31/2009 12:58 PM, John Burger wrote:
> Hi -
>
> I have some parallel Chinese-English UN proceedings scraped from the 
> UN website some years ago, and further processed by the Linguistic 
> Data Consortium.  I think the data were originally in one of the GB 
> variants, in MS Word or WordPerfect.
>
> The data is littered with some odd escape sequences, in both 
> languages, like this:
>
>   ... Permanent Representatives and Charg\x{5ee5} daffaires of Kuwait, 
> Burundi ...
>   -\x{e76f}现?常任?事国 ...
>
> According to the LDC README, the "\x{}" is their way of escaping 
> WordPerfect encodings that they could not convert.
>
> I can guess what some of these are - e76f seems to occur after in 
> contexts that indicate it's some kind of spacing character, perhaps a 
> tab.  Oddly, most of the rest seem to represent =two= characters.    
> For instance 5ee5 seems to be "és":
>
>   misleading clich\x{5ee5} that
>   Mr. Andr\x{5ee5} Pastrana Arango
>
> Here's some others:
>
>   highlighted by Mr. Rodr\x{74b2}uez
>   issued by the Espace r\x{5ee7}ublicain
>   transmitting an aide-m\x{5e66}oire issued
>
> These seem like odd choices for ligatures.  I can correct some of 
> these, but there are hundreds of different ones. Sorry if I'm 
> providing insufficient information, but can anyone shed any light on 
> this?
>
> Thanks!
>
> - John D. Burger
>   MITRE
>
>
>
>
>
This archive was generated by hypermail 2.1.5 : Tue Mar 31 2009 - 15:33:40 CST