Re: Korean language support and other Far Eastern Questions

From: Edward Cherlin (edward.cherlin.sy.67@aya.yale.edu)
Date: Tue Apr 25 2000 - 23:33:19 EDT


At 11:43 AM -0800 4/25/2000, Sarasvati wrote:
>Unicoders...
>
>I am forwarding on behalf of Mary Ink, a post that got side-tracked
>this morning.
>
> -- Sarasvati

Thanks.

> >From: "mary ink" <maryink@hotmail.com>
>To: unicode@unicode.org
>Subject: Korean language support and other Far Eastern Questions
>Date: Tue, 25 Apr 2000 19:26:04 GMT
>
>Doing research, as you might infer, on how Unicode handles Korean in the
>technical sense and how Unicode has handled the Far Eastern languages in a
>political sense. Any facts or views welcome.

The whole question of Korean is Unicode Far Eastern FAQ #2. (Question
#1 is about unification.) It has been hashed out many times on this
list.

>Why are some 11,171 places allocated to Hangul Syllables when the language
>system is made up of only 19 consonants (ja-um) and 21 single and combined
>vowels (mo-um)?

An official Korean standard mandates them. You can't blame it on
Unicode, Inc., or anyone here. We were against it. :)

>If the syllables could be made up from their constituent
>parts they wouldn't require double bytes, no?

Eh? You mean we could use 8-bit codes for them? That's what we're
trying to get away from.

It is true that Korean can be written alphabetically, using jamo
only. Then it has to be transformed into syllables for rendering. It
turns out that several thousand syllabic glyphs are needed. Unicode
tries to encode characters and not glyphs, but has had to add a
substantial number of glyphs for compatibility with existing
standards.

>Hangul letters are arranged in combinations of left to right and top to
>bottom depending on the shape and orientation of the vowel. Each arrangement
>of letters composes a syllable. So that syllables remain in proportion to
>each other, the shape and size of the letters within the syllable are
>modified. How do character display systems and coding standards such as
>Unicode handle these non-linear letter combinations and relative changes in
>letter shape?

Usually through Korean fonts containing precomposed syllable glyphs.
Automatic shaping rules have been tried, but are widely agreed to
give unsatisfactory results.

>How do the Hangul Compatibility Jamo characters 12592-12687 needed for
>compatibility with KSC 5601 encoding relate to the Hangul Syllables
>44032-55203?

Hex, please. Nobody here uses decimal for Unicode code points. Let's
see, the ranges for Korean are

U+3130-318F Hangul Compatibility Jamo
U+AC00-D7A3 Hangul Syllables

The range U+3131-3163 is for modern Korean, and includes all of the
jamo that occur in the syllabic characters. U+3165-318E are
historical.
 

>Olle Jarnefors explained in "A short overview of ISO/IEC 10646 and Unicode"
>http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html (1996) that
>"ISO/IEC 10646 and Unicode removes some assumptions of the made about plain
>text, which simplifies implementations but are untenable in multilingual
>text and monolingual text in some languages Characters cannot be identified
>with glyphs. Different graphic forms to be used in different situations are
>needed for some characters, e.g. Arabic letters". I find these statements
>impenetrable. Does he mean that Unicode considers the character
>independently of its appearance

Yes.

>and therefore is capable of handling text
>elements that change in appearance relative to position as they do in some
>languages including Korean?

Exactly. In principle this could be applied to Korean.

>Or does he mean the opposite?
>
>I understand that Unicode supports multidirectional text and overlapping or
>composite characters. Can it then handle the special multidirectional and
>composite character of the Hangul writing system?

Unicode does not cover rendering issues directly, but the answer is
Yes, text in Unicode encoding can be rendered according to the
requirements of the writing system.

>How international is the Unicode consortium?

Completely. All relevant national and international standards bodies
have taken part, along with universities and libraries around the
world. Specialists in each script are consulted, and working groups
formed from those most affected and those most knowledgeable.

>Lists of member companies I`ve
>seen are predominantly American.

Those are the implementors, since the software industry is
overwhelmingly U.S.-based.

>Has this had any bearing on how character
>codes have been standardized?

No. Unicode is not based on corporate character sets and encodings
(that includes IBM EBCDIC, Microsoft code pages, Adobe font
encodings, and others from Apple, Xerox, various UNIX vendors, or any
of the typesetter manufacturers or font houses) beyond making sure
that the characters in those encodings have a place in Unicode.

>How have national encoding standards been incorporated into Unicode and ISO
>UCS standards?

Fairly directly. Every code point in the national standards has a
corresponding code point in Unicode, so that translation from that
character code to Unicode and back (round-tripping) is guaranteed to
return the original character sequence. However, characters that are
encoded in more than one standard are unified at the some Unicode
code point wherever practical.

>Why and how was Unicode developed separately from ISO UCS standards and then
>agreed upon later? How do the 2 standards differ?

There is a whole appendix in The Unicode Standard on this history.
The projects started separately and then agreed to coordinate. They
encode the same characters at matching code points, but Unicode
specifies much more about character properties.

>There seems to be a privileging of ASCII characters in UTF-8 in that they
>require fewer bits at the expense of "less common" characters. Has there
>been any discussion about what seems to be an inequitable compromise?

Tons.

It was the only way to grandfather in the tens of millions of ASCII
texts on the Net and the billions more in files not on the Net. It
has been suggested that the extra space required for text in many
languages is not a serious inequity, now that hard drives are running
at about one cent per megabyte, and CD-Rs cost 8 cents each in packs
of 50.

>As a letter-based system with relatively few discreet symbols, Hangul is
>very easy to input using a keyboard.

Yes, the layout with the consonants on the left and the vowels on the
right is amazingly easy to learn.

>I understand, however, that the Han
>ideographs depend on clunky composition methods using the keyboard to make
>sound approximations of the character in question, which in turn display a
>choice of characters to select from (source: Elliotte Rusty Harold's XML
>Bible). Are there alternative ways to compose these characters with a
>keyboard based on root forms and strokes, the way they are listed in
>ideograph dictionaries? I realise this is complicated but there must be a
>way around the problem.

There are more than 200 methods for entering Han characters in
Chinese, Korean, Japanese, and Vietnamese. Here are some examples.

Phonetic (roma-kanji, kana-kanji, hangeul-hanja, pinyin-hanzi, zhuyin-hanzi...)
Code table (JIS, KSC, Big5, GB, TC, telegraph...)
Shape based (cangjie, wubi, four corners, radical/stroke...)

Most software for typing in any of these languages provides a
selection of entry methods. A common combination is one each of
 
a native script phonetic system
a code table based on a national standard

and one or more each of

a Latin-alphabet phonetic system (e.g. Wade-Giles + Pinyin)
a shape-based system (e.g. cangjie + four corners)

One-character-at-a-time phonetic conversion is indeed somewhat
clunky. Phrase-based and grammar-based conversions are much faster.
It is generally agreed that phonetic conversion is easiest for
beginners, and shape-based input is fastest for experienced users.

>Does the Unicode consortium concern itself with
>input technology?

Not as part of the standard. We discuss it here occasionally.

>I have read that Japanese programmers initially criticized UCS
>standardization. I also recall a great deal of furor in Korea over their
>homegrown word processing software, HWP, being buried by MS Word. What if
>any resistance has there been to Unicode and ISO UCS standardization,
>particularly among the language groups it is intended to better serve, and
>how has this been resolved? Has there been national resistance to the
>concept of "unified Han characters"?

No. The cultural view in each country is that the characters came
from China and are still Chinese. This is reflected in the name used
in all of these countries--Han characters, referring to the Chinese
dynasty in which brush calligraphy became the standard form of the
characters. (Previous forms were scratched on oracle bones, cast in
bronze, cut into seals, or written with pens.)

>The character differences seem subtle
>when considered scientifically but surely unifying the language codes of
>such antagonistic groups as Japan, North Korea, South Korea, China and
>Taiwan has been politically volatile.

Politically no, culturally somewhat. The furor has been based on
various misunderstandings of Unicode and on various false
assumptions. For example, people have asserted without checking that
Unicode would force people in one country to use fonts or glyphs from
another. Actually, text in one language can be rendered according to
the standards of that language, just as before. Multilingual
documents can be tagged so that each language is rendered correctly.

Edward Cherlin
Generalist
"A knot!" exclaimed Alice. "Oh, do let me help to undo it."
Alice in Wonderland



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT