Re: Understanding the Hangul mapping tables

From: Jungshik Shin (jshin@pantheon.yale.edu)
Date: Sun Dec 07 1997 - 10:59:29 EST


On Wed, 3 Dec 1997, Seong-Woong Kim wrote:

> Tim Greenwood wrote
>
> > Column 3 from the Hangul file matches column 1 from the Ksc5601 file -
> this
> > is reasonable since they are both labeled 'Unified Hangul'. How does this
> > relate to the Column 4 Johab, which is labeled as KSC5601-1992 ?

  UHC is an encoding which is upward-compatible with EUC-KR(Korean EUC),
but Johab is a totally different encoding from both EUC-KR(Korean EUC)
and UHC.

  It's misleading to label Column 4 as KS C 5601-1992 in
Hangul.txt. To avoid confusion like yours,
it should have been labeled just as Johab
encoding (or KS C 5601-1992:Annex 3).
                            ^^^^^^^

   The following lines in KSC5601.txt on Unicode CD-ROM and at Unicode
ftp site are also misleading. The table is NOT for KS C 5601-1992 to
Unicode BUT for UHC(Unified Hangul Code which is NOT specified in any Korean
natonal standard document but which is just a proprieatary encoding by
Microsoft) to Unicode. Accordingly, the name of the file to include UHC
to Unicode mapping should be changed to UHC.txt(or MSCP949.txt as
MS calls it Code Page 949). Moreover, there should be a
separate mapping table for KS C 5601-1992(designated as G1 and invoked
on GR) with only 94x94 characters to Unicode and/or EUC-KR(with 94
US-ASCII/ISO-646/KS C 5636 in 1byte range and 94x94 KS C 5601-1992 in
2byte range) to Unicode 2.0. One can easily make such a table and I can
supply one if necessary.

---------
The Ksc5601 files starts -

# Name: Unified Hangeul(KSC5601-1992) to Unicode table
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Unicode version: 2.0
# Table version: 1.0
# Table format: Format A
# Date: 07/24/95
# Authors: Lori Hoerth <lorih@microsoft.com>
# K.D.Chang <a-kchang@microsoft.com>
------------------------

> > Ken Lunde's book describes the byte range for Ksc5601-1992 as A1-FE for
> both
> > bytes.

 Well, I think, he wrote more than that. He's clear in distinguishing
character set(coded character set) from encoding. KS C 5601-1987 or KS
C 5601-1992 is not so much a definition of encoding as a defintion of
coded character set (conforming to ISO-2022) of 94x94 2byte characters.
There can be many different encodings for KS C 5601-(1987|1992) and
optionally other coded character set(s). The most widely used one in
Korea on all three major platforms (MS-DOS/Windows, Mac and Unix) with
some extensions on Mac and MS-DOS/Windows in C1 range is EUC-KR(Korean
EUC), which designates 1byte 94 character set ISO-646/KS C 5636/US-ASCII
as G0 and 2byte 94x94 character set KS C 5601-1987 as G1 and which
invokes G0 into GL and G1 into GR. Another encoding for KS C 5601-1987
along with ISO-646/KS C 5636/US-ASCII used exclusively in internet mail
exchange is ISO-2022-KR(see RFC 1557).

  Thus, you can't say the byte range for KS C 5601-1992 is A1-FE. Only
in encodings where it's designated as Gx(x=1,2,3) and Gx is invoked onto
GR, its byte range is that for GR(Graphic Right : A1-FE). One of those
encodings is EUC-KR(Korean EUC) encoding. In other encoding(e.g. in
ISO-2022-KR), KS C 5601-1987 is designated G1(with the designator 01/11
02/04 02/09 04/03) but G1 is invoked not onto GR(A1-FE) but onto
GL(21-7E) with the locking shift, SO(00/14).

>> The range in the Unified Hangul tables is 81-FD for byte 1 and

> 41-FF
> > for byte 2.
> > How does it all fit together? What are the actual codes that a Korean
> > browser will emit ?
>
> Korean Standard KS C 5601 - 1987 only included Wansung Hangul.
> In 1992 Korean govenment revised it to KS C 5601 - 1992.
> KS C 5601 - 1992 newly included Johab Hangul as Annex 3.

> So, Ken's information is right partly.
> The byte range for KS C 5601 - 1992 - Wansung is A1-FE for both bytes.
> But the byte range for KS C 5601 - 1992 Annex 3 - Johab is
>
> First byte range Second byte range
> Hangul: 84h - D3h 41h-7Eh, 81h-FEh
> User-defined area: D8h 31h-7Eh, 91h-FEh
> Etc.: D9h-DEh 31h-7Eh, 91h-FEh
> Hanja: E0h-F9h 31h-7Eh, 91H-FEh

  IMHO, 'annex' is just 'annex' and nothing more. Granted, Johab should
NOT be considered as included in KS C 5601-1992, let alone in KS C
5601-1987. In this light, Ken is correct and both KS C 5601-1987 and KS
C 5601-1992 are compliant to ISO-2022(and identical to each other).
If Johab is a part of KS C 5601-1992, KS C 5601-1992 would not be
compliant to ISO-2022 because Johab use C1 range for graphic characters.

> As we know, Unifiled Hangul was from Microsoft. It is MS's own
> extention including Wansung Hangul that enables us to use
> 11,172 Hangul characters.

  UHC is not a Korean nat'l standard, but just a proprieatary encoding
scheme devised by Microsoft. As mentioned above, UHC should not be
confused with KS C 5601-1992 nor with any specific encoding for it along
with KS C 5636-1992/ISO 646/US-ASCII such as EUC-KR although it's fully
upward-compatible with EUC-KR (it's an extension to EUC-KR, but it's not
compliant to ISO-2022). See 3.3.17 of Ken's CJK.inf

> Wansung Hangul and Johab Hangul are different.
> They have same characters(Hangul, Hanja, Etc.) on different
> code points.

  The number of characters covered is different as you implied above.
KS C 5601-1987 has only 2,350 Hangul syllables while Johab has 11,172(
all possible combinations in modern Korean) of them as Unicode 2.0/KS C
5700 does. Ken's CJK.inf (section 3.3.5) explains Johab encoding in
detail.

                                    

   Hope this would clarify things,

     Jungshik Shin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:38 EDT