RE: Wordprocessors in Korean

From: Jungshik Shin (jshin@mailaps.org)
Date: Mon Jul 16 2001 - 21:51:40 EDT


On Mon, 16 Jul 2001, Seuk Soo Sung wrote:

  Thank you for your reply with details.

> First of all, sorry for long mail. If you are not interesting Korean
> Wordprocessor, please stop here.

   Here are a few more questions (related with Unicode).

SS> not true at all. MS Word2K support full modern Hangul character set
SS> (11,172 chars) and almost 1.6M Old Hangul. Table feature was

  I also read some reports available at 'Non-standard character
registration center' (I found the page at http://www.sejong.or.kr and
http://www.korean.go.kr) and one of them (dated Dec. 2000) has more
or less the same information about MS Word 2000 as you gave. However,
that's different from what Chris Pratley wrote about MS Word 2000,
according to which MS Word 2000 supports only ~ 5000 'frequently used'
(or found in old literature) syllables and some 200 Gugyeol characters
using PUA.

CP> Word2000 does it using an add-in that uses the Unicode PUA to support
CP> about 5000 Old Hangul pre-composed glyphs.

 You also wrote the following. I'm wondering if PUA is large enough to
encode all ~1.6M syllables?

SS> A. MS Word2000
SS> MS Word2000 used PUA area in Unicode to implement Old Hangul and
SS> GooGyeul (actually code assignment for these characters) and MS

 Moreover, according to the report aforementioned, a html file
exported by MS Word 2000 (with Middle Korean ) can be rendered perfectly
well with Netscape 6.0 under MS-Windows 98SE. I believe this is another
indication that PUA is being used to represent Middle Korean in MS-Word
doc and html file exported from it because otherwise Netscape 6.0 (which
didn't use Uniscribe in December 2000 when the report was written as
far as I know) would not have been able to render Middle Korean correctly.

 I also looked into a font (New Gulim) I downloaded from
<http://www.officeupdate.com/korea/2000/articles/weboldhg.htm> (I found
the link to this page at <http://www.korean.go.kr/user_env.html>) and
indeed it looks like it's using PUA to represent about 5000 Old Hangul
syllables and ~250 gugyeol characters. On the other hand, the other font
I downloaded from the same page (Old Gulim) appears to have opentype
data for composing glphys for syllables out of glyphs for Jamos.

   Which is the case? Does MS Word 2000 support all 1.6M syllables or
just a subset (about 5000)? Perhaps, about 5000 of them are supported
with pre-composed glyphs in PUA and the rest are supported by composing
glyphs with opentype data. Am I right?

SS> 1. Supported characters
SS> and CJK Hanja characters (U+3400 ~ U+9FA5). Second is Old Hangul which
SS> was not used now but exist in old written documents since 1446 (Some
SS> Korean are using the Old Hangul for special purpose yet, even though it
SS> is not a standard right now in Korea). The number of Jamo in Old Hangul
SS> are much more than modern Hangul Jamo. According to our research result
SS> with National Language Research Institute, we defined 125 leading
SS> consonants, 95 vowels, and 141 trailing consonants for Old Hangul (of
SS> course, including modern Hangul because Old Hangul is proper superset of
SS> modern Hangul). Theoretically, we can make 1,686,250 (LV type =125*95,
SS> LVT type = 125*95*141) characters with these Jamo combination. MS Word
SS> support all of them at all.

   The last sentence is a bit confusing. Did you mean MS Word 2000
supports them all of ~ 1.6M syllables? How about 'incomplete syllables'
(*both* in modern Korean and Middle Korean) like 'LCF + V', 'LCF + V +
T', 'L + MVF + T', 'LCF + MVF + T' where LCF and MVF denote the leading
consonant filler and the medial vowel filler? If you're using Uniscribe
and U+1100 Hangul Jamos to compose Hangul Jamos with opentype data present
in fonts, I don't see why these cannot be supported in MS Word 2002.

  As for 125 leading consonants, 95 medial vowels, and 141 trailing
consonants, how do you represent them in Unicode? As of Unicode
3.1, U+1100 Jamo block defines 90/66/82 Jamos (this is where my ~
500k syllable count came from as opposed to ~ 1.6M ) and it doesn't
even have a room for all 125/95/141 in U+1100 block. Let's take an
example of an medial vowel, 'U + YEO' (U+116E + U+1167) or 'U +
I + EO' ( U+116E + U+1175 + U+1165 ). It's not in U+1100 block,
but I'm sure it's available in MS Word 2000 (or 2002) because that
particular vowel is used everyday by modern Korean speakers and is
also among 5 new vowels requested for addition to ISO 10646 by DPRK
(see <http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2243.pdf>). Internally,
I guess you can do whatever you want to do. What I'm curious about is
how it's represented in html (or xml) when you export a MS Word document
to a html (or xml) file. Do you use 'U + YEO' or 'U + I + EO'? If that's
the case, does it mean Uniscribe (and thus MS IE 6.0 if not MS IE 5.5)
can render 'U + YEO' as 'WYEO' given a font with necessary opentype data?

> The thrid one is GooGyeul which was used in
> old written documents. GooGyeul glyps look like similar with Hanja
> character but they were used in Korea with much different usage since
> very long years ago. MS Word also implemented 255 GooGyeul characters
> with cooperating National Language Research Institute and The Academy of
> Korean Studies.

  Again, how do you represent 255 Gugyeol characters in both MS Word 2000
and 2002? Do you use PUA here again? Or, do you just 'recycle' CJK
ideographs because most, if not all, Gugyeol characters are already
included in four CJK ideograph blocks and CJK radical block in Unicode
(if their shapes are the same as existing CJK ideographs, I guess it's
hard to make a case for encoding them separately.)? Actually, internal
representation is less of my concern than the way they're represented
when they're exported into other formats such as html.

  For others on the list, I put up a screenshot of glyphs for Gugyeol
in 'New Gulim' at <http://jshin.net/~jungshik/i18n/gugyeol.png>. Most
of them look like Chinese characters and radicals encoded in Unicode
3.x and don't need to be encoded (I think Unihan-3.1.txt has some
cross-references to Gugyol characters), but a few of them might need to be
encoded separately. Perhaps, I have to begin a new thread on this issue.

> 2. How implemented in MS Word

> B. MS Word2002
> MS Word2002 implemented Old Hangul using Uniscribe engine. MS Word2002
> also shipped Cicero input tool for Old Hangul input. Word2002 provided
> new OpenType font files for Old Hangul glyp composition and used the
> Jamos in U+1100 to compose Old Hangul characters directly using
> Uniscribe engine.

> 3. Compatibility with IE
> Yes, we tested it on MS Windows2000 and WindowsXP with IE6.0. IE6.0 has
> a capability to display OldHangul characters which was made by Word2000
> and Word2002.

  Have you tried html docs with Old Hangul represented in U+1100 Jamos
other than those produced by MS Word 2000/2002? I have a very simple
test page at <http://jshin.net/~jungshik/i18n/middle.html>. Could you
try that?

  Can I say that both MS IE (6.0) and MS Word 2002 use the same
underlying rendering engine (Uniscribe 1.0)? Or, is there any other
way to install Uniscribe (which I guess is included in MS Windows 2000
and XP by default)? If that's the case, would installing MS IE 6.0
on a computer running Windows 9x/ME/NT 4 make related Uniscribe APIs
available to programs other than MS IE and MS Word 2002? For instance,
can I use fuctions like 'ScriptShape' (in Uniscribe) to get a list
of glyphs for Unicode string with Hangul Jamos (as long as fonts with
opentype data for combining Jamos are available) in my program (say,
a simple text editor or terminal emulator) ? Does 'ScriptShape' work
for combinations like 'U+1101 U+116E U+1167 U+11BB' or 'U+1101 U+116E +
U+1175 + U+1165 U+11BB'? Maybe, it depends on opentype data included in
fonts. Then, the question would be if this kind of info. is include in
opentype data in fonts for Old Hangul that come with MS Word 2002.

> If you are using English Word2002 and Cicero input
> tool, and want to use Old Hangul, then you just need to select
> "Microsoft Korean Old Hangul input" in Cicero toolbar and install some
> additional fonts that have OpenType data to handle a proper Jamo
> combination with Uniscribe engine from Korean verison of MS Word2002
> PlusPack CD. Please contact to me, if you want to have these fonts (mail
> to me directly).
> GooGyeul input tool is still ActiveX add-in and can be installed from
> Korean PlusPack CD.

 As I wrote above, I'm wondering how 'Old Hangul' and 'Gugyeol' are
represented when a MS Word file is exported into a html file? Can you
send a sample MS Word file (e.g. Too-shi Eon-hae, either 1st edition -
Choganbon- or 2nd edition - Chungganbon) and html file (produced by MS
Word 2000 and 2002) off-line?

   Once again, thank you for your kind reply and great work to support
Middle Korean,

    Jungshik Shin



This archive was generated by hypermail 2.1.2 : Mon Jul 16 2001 - 23:35:21 EDT