RE: Wordprocessors in Korean

From: Seuk Soo Sung (seuksoos@microsoft.com)
Date: Tue Jul 24 2001 - 01:37:51 EDT


First of all, I am sorry that I couldn't reponse quickly because of
personal matter last week.

On Tue, 17 Jul 2001, Jungshik Shin wrote:

JS> Which is the case? Does MS Word 2000 support all 1.6M syllables or
just a subset (about 5000)?
JS> Perhaps, about 5000 of them are supported with pre-composed glyphs
in PUA and the rest are
JS> supported by composing glyphs with opentype data. Am I right?

Yes. MS Word2000 supported Old Hangul with two categorized characters
because of performance and efficiency. The first one (level 1 and
default on in input tool) is known characters founded in existing old
written documents. That is about 5000 characters and may frequently used
by Korean language researcher. They were in PUA with pre-composed glyphs
(U+E0BC ~ U+EFFF and U+F100 ~ U+F66E) in New Gulim, New GungSuh, New
Batang, New Dotum font files (you can download these fonts from
http://www.korean.go.kr) . The second one (level 2 and user option
(default off) in input tool) is glyph composition. As you know, someone
can find some new Old Hangul characters sooner or later and need to have
them. Therefore, MS added Old Hangul Jamo glyphs in PUA too (leading
consonant: U+F785 ~ U+F800, vowel: U+F807 ~ U+f864, trailing consonant:
U+F86B ~ U+F8F7) and make a character with supported glyph composition.
In summary, MS Word2000 support about 5000 pre-composed Old Hangul and
total 1.6M Old Hangul with pre-defined-glyph composition method.

JS> The last sentence is a bit confusing. Did you mean MS Word 2000
supports them all of ~ 1.6M
JS> syllables? How about 'incomplete syllables'
JS> (*both* in modern Korean and Middle Korean) like 'LCF + V', 'LCF + V
+ T', 'L + MVF + T',
JS> 'LCF + MVF + T' where LCF and MVF denote the leading consonant
filler and the medial vowel filler?
JS> If you're using Uniscribe and U+1100 Hangul Jamos to compose Hangul
Jamos with opentype data
JS> present in fonts, I don't see why these cannot be supported in MS
Word 2002.

See above for the total supported characters and method by MS Word2000.

Incomplete syllables are not supported for modern Hangul, but already
supported for incomplete Old Hangul in MS Word2000 and 2002. MS Word2000
use pre-defined Old Hangul jamos in PUA and MS Word2002 use U+1100
Hangul Jamos to compose Old Hangul including incomple syllables. You can
make incomplete syllables in MS Word2000 or MS Word2002.

JS> As for 125 leading consonants, 95 medial vowels, and 141 trailing
consonants, how do you represent
JS> them in Unicode? As of Unicode 3.1, U+1100 Jamo block defines
90/66/82 Jamos
JS> (this is where my ~ 500k syllable count came from as opposed to ~
1.6M ) and it doesn't even have
JS> a room for all 125/95/141 in U+1100 block.

As I mentioned above, these Jamos were defined in PUA area in MS
Word2000. But, MS Word2002 use just Hangul Jamo in Unicode (U+1100),
instead of PUA. Uniscribe engine shipped with MS Word2002 has a
capability to combine multiple Jamos to make a character. For example,
if you want to make 'U + YEO' (U+116E + U+1167), you just input U+116E
and U+1167. Then, Uniscribe engine generate U+YEO with two jamos (U+116E
and U+1167). In this case, if you generate a HTML file, U+YEO will be
represented with 4 bytes (U+116E and U+1167). Because of this, browser
should have a character combining capability. IE6.0 has this.

JS> Again, how do you represent 255 Gugyeol characters in both MS Word
2000 and 2002?
JS> Do you use PUA here again?

Yes. GooGyeul is assigned in PUA for both MS Word2000 and MS Word2002
because they are not in Unicode yet. If you want to use GooGyeul in your
document and publish it on the Web, you should have proper fonts, too.
MS Word made GooGyeul and fonts, and published them. Of course, you can
also get these font files from http://www.korean.go.kr.

JS> Have you tried html docs with Old Hangul represented in U+1100 Jamos
other than those produced
JS> by MS Word 2000/2002? I have a very simple test page
JS> at <http://jshin.net/~jungshik/i18n/middle.html>. Could you try
that?

Yes, I tried it. If you made a HTML docs using MS Word2002 with Old
Hangul represented in U+1100, you should use IE6.0+. IE5.x doesn't
support character composition feature.

JS> Can I say that both MS IE (6.0) and MS Word 2002 use the same
underlying rendering engine
JS> (Uniscribe 1.0)? Or, is there any other way to install Uniscribe
(which I guess is included in MS
JS> Windows 2000 and XP by default)? If that's the case, would
installing MS IE 6.0 on a computer
JS> running Windows 9x/ME/NT 4 make related Uniscribe APIs available to
programs other than MS IE
JS> and MS Word 2002?

Yes. Both MS Word2002 and WindowsXP ship Uniscribe engine (Windows XP
will has more recent version). If you install MS Word2002 on Windows98+
and NT4.0+, Word2002 installs it. (not supported in MS Windows95). IE6.0
standalone may not include Uniscribe.

JS> As I wrote above, I'm wondering how 'Old Hangul' and 'Gugyeol' are
represented when a MS Word file
JS> is exported into a html file? Can you send a sample MS Word file
(e.g. Too-shi Eon-hae, either 1st
JS> edition -Choganbon- or 2nd edition - Chungganbon) and html file
(produced by
JS> MS Word 2000 and 2002) off-line?

Again, if you create Old hangul and GooGyeul using MS Word and generate
HTML file, then you should distribute proper font files what you used
for Old Hangul and GooGyeul representation in Browser.
If you want to some Old Hangul sample, please install PlusPack CD from
MS Word2000 or MS Word2002. MS Word shipped 13 Old Hangul sample docs. I
will send you Too-Shi Eon-hae with seperate mail.

Thanks
SeuksooS

-----Original Message-----
From: Jungshik Shin [mailto:jshin@mailaps.org]
Sent: Tuesday, July 17, 2001 10:52 AM
To: Seuk Soo Sung
Cc: Chris Pratley; Unicode Mailing List
Subject: RE: Wordprocessors in Korean

On Mon, 16 Jul 2001, Seuk Soo Sung wrote:

  Thank you for your reply with details.

> First of all, sorry for long mail. If you are not interesting Korean
> Wordprocessor, please stop here.

   Here are a few more questions (related with Unicode).

SS> not true at all. MS Word2K support full modern Hangul character set
SS> (11,172 chars) and almost 1.6M Old Hangul. Table feature was

  I also read some reports available at 'Non-standard character
registration center' (I found the page at http://www.sejong.or.kr and
http://www.korean.go.kr) and one of them (dated Dec. 2000) has more or
less the same information about MS Word 2000 as you gave. However,
that's different from what Chris Pratley wrote about MS Word 2000,
according to which MS Word 2000 supports only ~ 5000 'frequently used'
(or found in old literature) syllables and some 200 Gugyeol characters
using PUA.

CP> Word2000 does it using an add-in that uses the Unicode PUA to
CP> support about 5000 Old Hangul pre-composed glyphs.

 You also wrote the following. I'm wondering if PUA is large enough to
encode all ~1.6M syllables?

SS> A. MS Word2000
SS> MS Word2000 used PUA area in Unicode to implement Old Hangul and
SS> GooGyeul (actually code assignment for these characters) and MS

 Moreover, according to the report aforementioned, a html file exported
by MS Word 2000 (with Middle Korean ) can be rendered perfectly well
with Netscape 6.0 under MS-Windows 98SE. I believe this is another
indication that PUA is being used to represent Middle Korean in MS-Word
doc and html file exported from it because otherwise Netscape 6.0 (which
didn't use Uniscribe in December 2000 when the report was written as far
as I know) would not have been able to render Middle Korean correctly.

 I also looked into a font (New Gulim) I downloaded from
<http://www.officeupdate.com/korea/2000/articles/weboldhg.htm> (I found
the link to this page at <http://www.korean.go.kr/user_env.html>) and
indeed it looks like it's using PUA to represent about 5000 Old Hangul
syllables and ~250 gugyeol characters. On the other hand, the other font
I downloaded from the same page (Old Gulim) appears to have opentype
data for composing glphys for syllables out of glyphs for Jamos.

   Which is the case? Does MS Word 2000 support all 1.6M syllables or
just a subset (about 5000)? Perhaps, about 5000 of them are supported
with pre-composed glyphs in PUA and the rest are supported by composing
glyphs with opentype data. Am I right?

SS> 1. Supported characters
SS> and CJK Hanja characters (U+3400 ~ U+9FA5). Second is Old Hangul
SS> which was not used now but exist in old written documents since 1446

SS> (Some Korean are using the Old Hangul for special purpose yet, even
SS> though it is not a standard right now in Korea). The number of Jamo
SS> in Old Hangul are much more than modern Hangul Jamo. According to
SS> our research result with National Language Research Institute, we
SS> defined 125 leading consonants, 95 vowels, and 141 trailing
SS> consonants for Old Hangul (of course, including modern Hangul
SS> because Old Hangul is proper superset of modern Hangul).
SS> Theoretically, we can make 1,686,250 (LV type =125*95, LVT type =
SS> 125*95*141) characters with these Jamo combination. MS Word support
SS> all of them at all.

   The last sentence is a bit confusing. Did you mean MS Word 2000
supports them all of ~ 1.6M syllables? How about 'incomplete syllables'
(*both* in modern Korean and Middle Korean) like 'LCF + V', 'LCF + V +
T', 'L + MVF + T', 'LCF + MVF + T' where LCF and MVF denote the leading
consonant filler and the medial vowel filler? If you're using Uniscribe
and U+1100 Hangul Jamos to compose Hangul Jamos with opentype data
present in fonts, I don't see why these cannot be supported in MS Word
2002.

  As for 125 leading consonants, 95 medial vowels, and 141 trailing
consonants, how do you represent them in Unicode? As of Unicode 3.1,
U+1100 Jamo block defines 90/66/82 Jamos (this is where my ~ 500k
syllable count came from as opposed to ~ 1.6M ) and it doesn't even have
a room for all 125/95/141 in U+1100 block. Let's take an example of an
medial vowel, 'U + YEO' (U+116E + U+1167) or 'U + I + EO' ( U+116E +
U+1175 + U+1165 ). It's not in U+1100 block, but I'm sure it's available
in MS Word 2000 (or 2002) because that particular vowel is used everyday
by modern Korean speakers and is also among 5 new vowels requested for
addition to ISO 10646 by DPRK (see
<http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2243.pdf>). Internally, I guess
you can do whatever you want to do. What I'm curious about is how it's
represented in html (or xml) when you export a MS Word document to a
html (or xml) file. Do you use 'U + YEO' or 'U + I + EO'? If that's the
case, does it mean Uniscribe (and thus MS IE 6.0 if not MS IE 5.5) can
render 'U + YEO' as 'WYEO' given a font with necessary opentype data?

> The thrid one is GooGyeul which was used in
> old written documents. GooGyeul glyps look like similar with Hanja
> character but they were used in Korea with much different usage since
> very long years ago. MS Word also implemented 255 GooGyeul characters
> with cooperating National Language Research Institute and The Academy
> of Korean Studies.

  Again, how do you represent 255 Gugyeol characters in both MS Word
2000 and 2002? Do you use PUA here again? Or, do you just 'recycle' CJK
ideographs because most, if not all, Gugyeol characters are already
included in four CJK ideograph blocks and CJK radical block in Unicode
(if their shapes are the same as existing CJK ideographs, I guess it's
hard to make a case for encoding them separately.)? Actually, internal
representation is less of my concern than the way they're represented
when they're exported into other formats such as html.

  For others on the list, I put up a screenshot of glyphs for Gugyeol in
'New Gulim' at <http://jshin.net/~jungshik/i18n/gugyeol.png>. Most of
them look like Chinese characters and radicals encoded in Unicode 3.x
and don't need to be encoded (I think Unihan-3.1.txt has some
cross-references to Gugyol characters), but a few of them might need to
be encoded separately. Perhaps, I have to begin a new thread on this
issue.

> 2. How implemented in MS Word

> B. MS Word2002
> MS Word2002 implemented Old Hangul using Uniscribe engine. MS Word2002

> also shipped Cicero input tool for Old Hangul input. Word2002 provided

> new OpenType font files for Old Hangul glyp composition and used the
> Jamos in U+1100 to compose Old Hangul characters directly using
> Uniscribe engine.

> 3. Compatibility with IE
> Yes, we tested it on MS Windows2000 and WindowsXP with IE6.0. IE6.0
> has a capability to display OldHangul characters which was made by
> Word2000 and Word2002.

  Have you tried html docs with Old Hangul represented in U+1100 Jamos
other than those produced by MS Word 2000/2002? I have a very simple
test page at <http://jshin.net/~jungshik/i18n/middle.html>. Could you
try that?

  Can I say that both MS IE (6.0) and MS Word 2002 use the same
underlying rendering engine (Uniscribe 1.0)? Or, is there any other way
to install Uniscribe (which I guess is included in MS Windows 2000 and
XP by default)? If that's the case, would installing MS IE 6.0 on a
computer running Windows 9x/ME/NT 4 make related Uniscribe APIs
available to programs other than MS IE and MS Word 2002? For instance,
can I use fuctions like 'ScriptShape' (in Uniscribe) to get a list of
glyphs for Unicode string with Hangul Jamos (as long as fonts with
opentype data for combining Jamos are available) in my program (say, a
simple text editor or terminal emulator) ? Does 'ScriptShape' work for
combinations like 'U+1101 U+116E U+1167 U+11BB' or 'U+1101 U+116E +
U+1175 + U+1165 U+11BB'? Maybe, it depends on opentype data included in
fonts. Then, the question would be if this kind of info. is include in
opentype data in fonts for Old Hangul that come with MS Word 2002.

> If you are using English Word2002 and Cicero input
> tool, and want to use Old Hangul, then you just need to select
> "Microsoft Korean Old Hangul input" in Cicero toolbar and install some

> additional fonts that have OpenType data to handle a proper Jamo
> combination with Uniscribe engine from Korean verison of MS Word2002
> PlusPack CD. Please contact to me, if you want to have these fonts
> (mail to me directly). GooGyeul input tool is still ActiveX add-in and

> can be installed from Korean PlusPack CD.

 As I wrote above, I'm wondering how 'Old Hangul' and 'Gugyeol' are
represented when a MS Word file is exported into a html file? Can you
send a sample MS Word file (e.g. Too-shi Eon-hae, either 1st edition -
Choganbon- or 2nd edition - Chungganbon) and html file (produced by MS
Word 2000 and 2002) off-line?

   Once again, thank you for your kind reply and great work to support
Middle Korean,

    Jungshik Shin



This archive was generated by hypermail 2.1.2 : Thu Jul 26 2001 - 23:26:24 EDT