RE: Basic question or maybe not

From: Paul Dempsey (Exchange) (paulde@exchange.microsoft.com)
Date: Thu Apr 29 1999 - 13:15:57 EDT


> -----Original Message-----
> From: Alfinito, Charles [mailto:AlfinitoC@cadmus.com]
> Sent: Thursday, April 29, 1999 6:15 AM
> To: Unicode List
> Subject: Basic question or maybe not
>
>
> The more I see Unicode the less I understand.
> My problem is this: I'm looking at an Rich Text Format file
> that has a
> Greek alpha in it. This is represented by \u-3999\'61.
>
> Well the \'61 is a Greek alpha. But convert the Unicode
> 3999 (0F9F) and I
> come up with a Tibetan TA.

The number after the '\u' is the decimal representation of a
signed short integer. The number is -3999, not 3999.

> Can anyone shed light on this and can anyone explain (simply)
> how a Unicode
> representation appears in an RTF file? What puts it there in
> the first
> place?

When the document contains a character that isn't representable
in the codepage for the RTF file (\cpg), Word writes the character
as the \u#### (signed decimal) Unicode value paired with the \'## (hex)
value of the closest approximation of the Unicode character in the
codepage.

> My job consists of stripping out RTF code (through program
> scripts that I
> write) from a file and creating a plain ASCII test file.
> Since Unicode was
> introduced it has caused nothing but problems with interpreting what a
> character should be.

If you want ASCII, then ignore the \u# codes and
use only the \'## codes.

I recommend that you download the latest RTF specification from
the Microsoft web site. It details how Unicode and codepages
are handled in RTF.

--- Paul



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT