Re: General category of Hangul Conjoining Jamos (U+1100 block)

From: Mark Davis (mark@macchiato.com)
Date: Sun May 05 2002 - 11:10:06 EDT


In the early days of Unicode development, there were two models. As an
example, I'll give a complex (non-existent) old Hangul syllable:
TIKEUT + PIEUP+ A + EU + TIKEUT + PIEUP.

A. Non-spacing mark model.
With this method, there are base jamo and non-spacing jamo. Base jamo
are all consonants, while trailing were consonants and vowels. So the
structure of a Hangul syllable was B N*. The B values would have been
the existing independent jamo, plus an additional set of N values. For
example:

U+3137 ( ㄷ ) {HANGUL LETTER TIKEUT}
U+1107 ( ᄇ ) {HANGUL NON-SPACING* PIEUP}
U+1161 ( ᅡ ) {HANGUL NON-SPACING* A}
U+1173 ( ᅳ ) {HANGUL NON-SPACING* EU}
U+1103 ( ᄃ ) {HANGUL NON-SPACING* TIKEUT}
U+1107 ( ᄇ ) {HANGUL NON-SPACING* PIEUP}

Notice that the leading PIEUP and the trailing PIEUP are represented
by the same code; however, the first TIKEUT -- being the first
character in the syllable -- is different than the trailing TIKEUT.

B. Conjoining jamo.
This is the current mechanism
(http://www.unicode.org/unicode/reports/tr28/#3_11_conjoining_jamo_beh
avior). Syllables are L+ V+ T*. For example:

U+1103 ( ᄃ ) {HANGUL CHOSEONG TIKEUT}
U+1107 ( ᄇ ) {HANGUL CHOSEONG PIEUP}
U+1161 ( ᅡ ) {HANGUL JUNGSEONG A}
U+1173 ( ᅳ ) {HANGUL JUNGSEONG EU}
U+11AE ( ᆮ ) {HANGUL JONGSEONG TIKEUT}
U+11B8 ( ᆸ ) {HANGUL JONGSEONG PIEUP}

There are pluses (and corresponding minuses) to both:

- Structurally, (A) is more aligned with the way that other Unicode
characters work, like A + UMLAUT + ACUTE, since all but the first jamo
do not contribute to the width of the character -- i.e. are
non-spacing.

- Option (B) disallows certain combinations that are nonsense, such as
TIKEUT + A + TIKEUT + A + TIKEUT (although both allow combinations
like TIKEUT + TIKEUT + TIKEUT + TIKEUT + A).

However, both take essentially the same number of allocated
characters, and both have essentially the same representational
capabilities.

There were long discussions on the best model to use, but we finally
ended up with model B to accomodate the requests of the Korean
national body in the merger between ISO 10646 and Unicode. They
explicitly did not want the characters classified as 'combining' or
'non-spacing', so the new term 'conjoining' was coined instead.

Note: the classification of Mc (combining, yet spacing) was added into
Unicode also because of the merger. In my opinion, it is conceptually
very ill-defined, and not at all consistently applied across the range
of Unicode characters. Luckily, it has very little practical import;
since the Mc's are all of canonical combining class zero, there are
very few if any actual cases where implementations would treat them as
any different than Lo (uncased letters).

Mark

—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Jungshik Shin" <jshin@mailaps.org>
To: "Unicode Mailing List" <unicode@unicode.org>
Sent: Saturday, May 04, 2002 22:20
Subject: General category of Hangul Conjoining Jamos (U+1100 block)

>
> Hello,
>
> I've always wondered why Hangul Conjoing medial vowels
> (U+1160-U+11A2) and trailing conosnants(U+11A8-U+11F9) are
> classified as Lo (letter other) instead of Mn(combining,
> non-spacing). A couple of days ago, a string of events
> happened and I finally decided to raise this issue in this
> forum. ( For some background information, see
> <http://jshin.net/i18n/korean/hunmin.html>). Here I'm trying to
> present my case for why that change is necessary. Hopefully,
> I'll make it convincing enough for the UTC to make necessary
changes.
>
>
> TUS 3.0(p.53. it doesn't use the regular expression) defines a
Hangul
> syllable as
>
> S := L+V+T*
>
> where L,V, and T denote Hangul leading consonants, Hangul
> medial vowels and Hangul trailing consonants, respectively and
> '+' and '*' have their usual RE meanings. An optional Hangul tone
> mark M (U+302E and U+302F) may be added and we have
>
> S := L+V+T*M?
>
> U+302E and U+302F are classified as Mn. I find it hard to
understand
> why V and T are put into Lo category instead of Mn while
vowels/vowel
> marks and 'subjoined' consonants in South and South East Asian
> scripts are put into Mn (or Mc in some cases).
>
> It seems to me that the 'rendering behavior' of V and T with L('s)
> acting as a base character is similar to that of vowels/vowel marks
and
> 'subjoined' consonants with 'head' consonant(s) acting as a base
character
> in South and Southeast Asian scripts. Just as vowels/vowel marks
and
> 'subjoined' consonants should be kept together with what they follow
> ( head consonants) (e.g. they should not be broken across two
lines),
> V and T (and M) have to be kept together with L. Moreover,
applications
> like terminal emulators should treat V and T as taking zero
screen-width
> and allotcate a sequence of L,V,T the same screen width as L(for
Hangul
> Jamos, it's 'double screen width'). That requirement is very similar
> to what's required of a sequence of a head consonant, (a) subjoined
> consonant(s)(as found in Tibetan), and (a) vowel/vowel mark(s)
> in South and Southeast Asian scripts.
>
> An implementation like Markus Kuhn's wcwidth.c automatically
> generates the table out of UnicodeData.txt, but it has to make an
> exception about Hangul medial vowels and trailing consonants.
> (see http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c). ICU function
> u_charCellWidth() also returns U_ZERO_WIDTH for Hangul vowels and
trailing
> consonants whereas it returns U_FULL_WIDTH for Hangul leading
consonants.
> (http://oss.software.ibm.com/icu/apiref/uchar_8h.html#a360). It's
to
> their credits that they noticed the need to make an 'exception' for
> Hangul Jamos, but I'm afraid some implementations may blindly rely
on
> UnicodeData.txt. Some developers may feel uncomfortable deviating
from
> (what they perceive as) the Unicode standard even when contacted (by
> me or others) for doing a special treatment of Hangul Conjoining
Jamos.
> To avoid a potential problem arising from this possibility, in my
opinion
> it's necessary to make changes I'm suggesting.
>
>
> Although assigning Mn to Hangul vowels and trailing consonants
> appears to have little problem, Hangul leading consonants
> don't seem to fit the definition of any exisitng category. When
used
> at the beginning of the sequence
> for a syllable (represented with 'LVT?'), it can be 'Lo'.
> However, if multiple L's are used in the Jamo sequence for a given
> syllable(that is, 'L{2,}V+T*'), all but the first one are
> combining/non-spacing. I think the same problem exists for
> consonants in some South and Southeastern scripts for which
> consonants are only encoded once(i.e. subjoined consonants
> are not encoded separately as is the case of Tibetan). For instance,
> Devanagari consonants (U+0915 - U+0939) are Lo although they can
> be combining when they're not the first consonant in a syllable.
> Given this, I believe assigning Lo to Hangul Conjoining leading
> consonants can be justified unless UTC decides to adopt
> a more fine-grained category scheme than the current one.
>
> In summary, I proposed that the general category of Hangul
> Conjoining medial vowels and trailing consonants (U+1160 - U+11FF)
> be changed from Lo(letter others) to Mn(non-spacing, combining)
> to be in line with and meet the rendering and other requirements of
> Hangul Conjoining Jamos.
>
> Thank you in advance for considering my suggestion,
>
> Jungshik Shin
>
> P.S. The following image may be quite suggestive of what I wrote
> above.
>
> http://chem.skku.ac.kr/~wkpark/trash/xuhpulm.png
>
> P.S.2:
> The fact that exactly the same technique as described in the section
> 'Thai rendering behavior' in the following summary of Thai by
GNU/Linux/X
> has been used for Hangul rendering is another indication
> that Hangul Jamos have to be treated similarly to the way
> South and Southeast Asian scripts are treated.
>
> ftp://ftp.nectec.or.th/pub/thailinux/cvs/docs/thaisupp/thaisupp.html
>
>
>



This archive was generated by hypermail 2.1.2 : Sun May 05 2002 - 12:13:07 EDT