Another way to put it is that the Greek text Charles has is in fact NOT
Greek, just a collection of symbols that happens to look like Greek. As
such, the file is essentially corrupt. Word doesn't consider symbols to be
part of human languages, so the Greek proofing tools don't run on them, etc.
The "61" is the index into the symbol font for the chosen symbol, which
happens to be an alpha. It is a "lucky" coincidence (actually by design)
that the index values into the symbol font for the Greek symbols happen to
match the codepoints in the Greek code page.
If you want to convert some text from "symbolic" Greek to actual Greek (that
is, in the Unicode Greek range), there is a macro downloadable from the
Greek Microsoft web site that will do this. You can then use fonts like
Arial and times New Roman on the text since it is now "real" Greek. The
macro exists because documents sometimes end up encoded this way when
converted to Unicode because someone using a pre-Unicode version of Word
(e.g. Word6) may have unknowingly used a symbol font (like Symbol) rather
than a WGL TrueType font (like Arial), or they may have used a hacked font
from a third party that was incorrectly encoded as a symbol font, so the
conversion to Unicode failed. Once in Word97, Greek text typed via a Greek
keyboard is unequivocally Greek in Unicode, getting rid of this confusion
for the future. This is one of the benefits of Unicode - disambiguating text
from symbols, and keeping everything straight and uniquely encoded
internally.
Another poster mentioned that the "codepage" value following the Unicode
representation is in the base code page of the RTF. This isn't quite true -
it is by default, but can be overridden by the charset of the font applied
to the text, which can vary within a single file. Definitely, Charles should
download the RTF spec form Microsoft and become familiar with this.
Finally, if the goal is to create plain text out of a Word document, then
Charles should simply save as plain text (ANSI), or as Unicode text from
Word97, and skip RTF altogether.
Chris Pratley
Microsoft Office Program Manager
-----Original Message-----
From: peter_constable@sil.org [mailto:peter_constable@sil.org]
Sent: Thursday, April 29, 1999 11:49 AM
To: Unicode List
Subject: Re: Basic question or maybe not
Charles:
There's a little more to your query that the responses so far
haven't touch on: when you see \u-3999 in the RTF file, the
number after the \u is a decimal representation of a *signed*
16-bit integer. In other words, if you see the minus sign, add
65536 to the number. In this case, you get 61537, = xF061.
Now, if you look up U+F061, you'll find that it's in the middle
of the Private Use Area. Why? Well, back when Microsoft
introduced TrueType fonts, they decided that they wanted to use
Unicode internal to the font, but they needed some way to deal
with things like symbols and dingbats - such a character might
not ever be assigned a Unicode value. The solution they adopted
was to allow two flavours of TrueType fonts: symbol-encoded,
and "WGL" or "UGL". The latter use only standard Unicode
character allocations, and can contain large numbers of glyphs.
The former have all the glyphs accessed in the font using the
PUA sub-range U+F020 - U+F0FF. If Windows detects that a font
is a "symbol" font (charset 2), then it will map 8-bit codes
x20 - xFF to xF020 - xF0FF by adding xF000. So, what you're
looking at is a character that's formatted with a symbol font.
In fact, the Symbol font that ships with Windows has a Greek
alpha at xF061.
Peter Constable
Non-Roman Script Initiative, SIL
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT