General category of Hangul Conjoining Jamos (U+1100 block)

From: Jungshik Shin (jshin@mailaps.org)
Date: Sun May 05 2002 - 01:20:52 EDT

Previous message: Mark Davis: "Re: evertype.com"
Next in thread: Mark Davis: "Re: General category of Hangul Conjoining Jamos (U+1100 block)"
Reply: Mark Davis: "Re: General category of Hangul Conjoining Jamos (U+1100 block)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hello,

I've always wondered why Hangul Conjoing medial vowels
(U+1160-U+11A2) and trailing conosnants(U+11A8-U+11F9) are
classified as Lo (letter other) instead of Mn(combining,
non-spacing). A couple of days ago, a string of events
happened and I finally decided to raise this issue in this
forum. ( For some background information, see
<http://jshin.net/i18n/korean/hunmin.html>). Here I'm trying to
present my case for why that change is necessary. Hopefully,
I'll make it convincing enough for the UTC to make necessary changes.

TUS 3.0(p.53. it doesn't use the regular expression) defines a Hangul
syllable as

S := L+V+T*

where L,V, and T denote Hangul leading consonants, Hangul
medial vowels and Hangul trailing consonants, respectively and
'+' and '*' have their usual RE meanings. An optional Hangul tone
mark M (U+302E and U+302F) may be added and we have

S := L+V+T*M?

U+302E and U+302F are classified as Mn. I find it hard to understand
why V and T are put into Lo category instead of Mn while vowels/vowel
marks and 'subjoined' consonants in South and South East Asian
scripts are put into Mn (or Mc in some cases).

It seems to me that the 'rendering behavior' of V and T with L('s)
acting as a base character is similar to that of vowels/vowel marks and
'subjoined' consonants with 'head' consonant(s) acting as a base character
in South and Southeast Asian scripts. Just as vowels/vowel marks and
'subjoined' consonants should be kept together with what they follow
( head consonants) (e.g. they should not be broken across two lines),
V and T (and M) have to be kept together with L. Moreover, applications
like terminal emulators should treat V and T as taking zero screen-width
and allotcate a sequence of L,V,T the same screen width as L(for Hangul
Jamos, it's 'double screen width'). That requirement is very similar
to what's required of a sequence of a head consonant, (a) subjoined
consonant(s)(as found in Tibetan), and (a) vowel/vowel mark(s)
in South and Southeast Asian scripts.

An implementation like Markus Kuhn's wcwidth.c automatically
generates the table out of UnicodeData.txt, but it has to make an
exception about Hangul medial vowels and trailing consonants.
(see http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c). ICU function
u_charCellWidth() also returns U_ZERO_WIDTH for Hangul vowels and trailing
consonants whereas it returns U_FULL_WIDTH for Hangul leading consonants.
(http://oss.software.ibm.com/icu/apiref/uchar_8h.html#a360). It's to
their credits that they noticed the need to make an 'exception' for
Hangul Jamos, but I'm afraid some implementations may blindly rely on
UnicodeData.txt. Some developers may feel uncomfortable deviating from
(what they perceive as) the Unicode standard even when contacted (by
me or others) for doing a special treatment of Hangul Conjoining Jamos.
To avoid a potential problem arising from this possibility, in my opinion
it's necessary to make changes I'm suggesting.

Although assigning Mn to Hangul vowels and trailing consonants
appears to have little problem, Hangul leading consonants
don't seem to fit the definition of any exisitng category. When used
at the beginning of the sequence
for a syllable (represented with 'LVT?'), it can be 'Lo'.
However, if multiple L's are used in the Jamo sequence for a given
syllable(that is, 'L{2,}V+T*'), all but the first one are
combining/non-spacing. I think the same problem exists for
consonants in some South and Southeastern scripts for which
consonants are only encoded once(i.e. subjoined consonants
are not encoded separately as is the case of Tibetan). For instance,
Devanagari consonants (U+0915 - U+0939) are Lo although they can
be combining when they're not the first consonant in a syllable.
Given this, I believe assigning Lo to Hangul Conjoining leading
consonants can be justified unless UTC decides to adopt
a more fine-grained category scheme than the current one.

In summary, I proposed that the general category of Hangul
Conjoining medial vowels and trailing consonants (U+1160 - U+11FF)
be changed from Lo(letter others) to Mn(non-spacing, combining)
to be in line with and meet the rendering and other requirements of
Hangul Conjoining Jamos.

Thank you in advance for considering my suggestion,

Jungshik Shin

P.S. The following image may be quite suggestive of what I wrote
above.

http://chem.skku.ac.kr/~wkpark/trash/xuhpulm.png

P.S.2:
The fact that exactly the same technique as described in the section
'Thai rendering behavior' in the following summary of Thai by GNU/Linux/X
has been used for Hangul rendering is another indication
that Hangul Jamos have to be treated similarly to the way
South and Southeast Asian scripts are treated.

ftp://ftp.nectec.or.th/pub/thailinux/cvs/docs/thaisupp/thaisupp.html

Previous message: Mark Davis: "Re: evertype.com"
Next in thread: Mark Davis: "Re: General category of Hangul Conjoining Jamos (U+1100 block)"
Reply: Mark Davis: "Re: General category of Hangul Conjoining Jamos (U+1100 block)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Sun May 05 2002 - 02:23:04 EDT