Hangul syllable boundary and opentype fonts/rendering

From: Jungshik Shin (jshin@mailaps.org)
Date: Mon Apr 07 2003 - 01:07:39 EDT

Next message: Abdij Bhat: "UNICODE-non-clashing ASCII character needed"

Previous message: Tex Texin: "Opera 7 supports UTF-32"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Note to those on the Unicode list. we've been discussing opentype support
of Korean script on the opentype list, but there are some issues that
can be better answered on the Unicode list so that I'm copying this
to the Unicode list as well. Because the opentype list archive is not
available for the public, I put up two of my previous messages (that
quotes Paul's reply) at

http://jshin.net/i18n/korean/ot.msg.1.txt
http://jshin.net/i18n/korean/ot.msg.2.txt

It'd be great if your can share your insights on the issue.

Regards,

Jungshik

Paul Nelson (TYPOGRAPHY) wrote:

Dear Paul,

Thank you for your interest in the discussion.
I wish I had contacted you much earlier than I did.

>First, I need to begin by stating that Unicode is not a linguistic
>encoding, nor is it supposed to be.

It is certainly not for Hangul as it is now, which I regard as
very unfortunate, but I'm not sure if it's not supposed to be. Numerous
threads on the encoding of Indic (and related) scripts in South Asia
and SouthEast Asia on the Unicode list appear to indicate that UTC
and ISO JTC1/SC2/WG2 have been trying to make the encoding model
for them in Unicode/ISO 10646 reflect the (linguistic) principles
of those scripts as faithfully as possible (or whenever
it makes sense).

> With that in mind, it is important
>to identify the fact that we are constrained to work within the bound of
>the Unicode/ISO character encoding specifications.
>
>Here is data from the Unicode site:
>
>1100;HANGUL CHOSEONG KIYEOK;Lo;0;L;;;;;N;;g *;;;
>1101;HANGUL CHOSEONG SSANGKIYEOK;Lo;0;L;;;;;N;;gg *;;;
>

I'm well aware of this as I at least implied in my two previous messages and I
think requesting to make *all* Jamos in U+1100 block atomic (regardless
of whether they're clusters or not) was one of several blunders made
by South Korean standard body.

JS> Not supporting composition of cluster/complex Jamos made up of
JS> simple/basic Jamos (for instanace, U+1101 is nothing but a 'presentation
JS> form' of 'U+1100 U+1100') just because they're given separate codepoints
JS> as presentation forms in Unicode is squarely against the principles of
JS> Korean script as envisioned by its creators in the 15th century. [1].
JS> Those complex/clusters jamos got encoded (e.g. U+1133 =
JS> U+1109 U+1107 U+1100) not because they're any way superiror to or more
JS> fundamental than those NOT separately encoded (e.g. U+1105 U+1107
JS> U+1107). They were just *lucky* to be spotted by Korean linguists
JS> when the list was compiled and submitted to ISO/IEC JTC1/SC2/WG2 in
JS> early 1990's. [2]

I wish UTC had not been so eager to honor its request especially
considering that it's now impossible to mend this problem because
UTC committed itself NOT to modify the canonical composition/
decomposition for any existing characters. However, this issue can
be partly resolved/worked around by introducing tailored (canonical)
(de)composition that is on the table for UTC if I understand it correctly.
BTW, Kent Karlsson wrote a paper on the issue.
(Kent, have you put your paper(draft) somewhere on the net? Could
you give us the URL if you did? )

>Because of this data, I cannot state that the form of U+1100 U+1100 will
>result in U+1101. As we see above, the U+1101 has no decomposition form
>specified. Thus, there is no need to make an engine to support this.
>Additionally, I would argue that I cannot make an engine to support this
>form as you suggest as I would have to violate Unicode properties to do
>so.
>

Well, I'm afraid UTC got Unicode sort of 'in conflict with' (not
exactly a conflict but a point to be made clearer) Unicode
itselt by NOT making Jamo clusters cannonically equivalent to
sequences of basic/simple Jamos.
In 3.11 of Unicode 3.2, Hangul syllable is defined as

S := (L+ V+ T* | L* S1 V*T* | L* S2 T*)
where S1 is LV type syllable and S2 is LVT type syllable.

Now, it's rather silent as to how sequences likes 'U+1109 U+AC01'
( = U+1109 U+1100 U+1160 U+11A8) are supposed to be rendered. If
'U+1109 U+1100' = U+112D, there'd be no issue at all. Unfortunately,
U+1109 U+1100 is not canonically equivalent to U+11AD and it'll
never be because NFC/NFD were frozen. However, I think rendering
engines/layout libraries like Uniscribe and OT fonts can take some
liberty to interpret and best match what users intend when they come
across 'U+1109 U+AC01'. I also believe that this is also more in the
spirit of Unicode 3.2 section 3.11 and UTR #29 according to which
'U+1109 U+AC01' is regarded as forming a *single* grapheme (syllable in
this case) instead of two graphemes. In other words, 'U+1109 U+AC01'
has to be treated and rendered as a unit. So, if it's followed by
'U+302E' (Hangul Single Dot Tone Mark), U+302E has to be put to the
left of the cluster 'U+1109 U+AC01' (=> U+1109 U+1100 U+1160 U+11A8 =>
U+112D U+1160 U+11A8) instead of between U+1109 and U+AC01 (to
the left of U+AC01).

>As you have pointed out, this mades Unicode not handle the Korean script
>as it was envisioned by its creators in the 15th century. Unicode is not
>specifically designed for the purpose of handling scripts in the way in
>which they were designed to begin with, but to be able to correctly
>represent text in a Uniform manner that allows for unambiguous exchange
>of data.

The way I think Hangul should have been encoded (closely
matching the intents of its inventors) would have paved a lot cleaner
way for a uniform representation of Korean scripts than the current
Unicode does. Most, if not all, of blames for this problem have
to be taken not by UTC but by my government and its incompetency and
short-sightedness in stark contrast with the foresight and competency
of Indian government that came up with infinitely better encoding
models for Indic scripts (in ISCII and Unicode/ISO 10646)
which are similar to Korean script in a number of aspects. Anyway,
we have to live with the reality and, IMHO, a possible way to
work around it is introducing tailored composition/decomposition
that is optional on the paper but is implicitly semi-official/required.
[1]

>In your example, one person might type U+1100 U+1100 while another types
>U+1101. This would lead to confusion in the "correct" manner to
>represent the encoding of the shape that looks like U+1101. By following
>Unicode as it exists (with its imperfections) we have the ability to
>support the open exchange of text and the digital recording of text that
>we can preserve into the future.

As you know very well, there are multiple "correct" ways to represent
identical characters/letters in Unicode and the way to solve
problems arising from multiple representations is canonical
composition/decomposition. Unfortunately, for Hangul, canonical
composition of complex/cluster Jamos out of basic/simple jamo
sequences is missing, but I hope that the issue will be partly solved by
introducing tailoring of composition/decomposition as mentioned above. In
the meantime, what I suggested is NOT to make MS products(Uniscribe
in particular) generate text (not compliant to Unicode as it is now)
BUT to make them generously accept 'decomposed cluster/complex Jamos'
and treat them as their corresponding 'precomposed' forms when they're
coming from outside. This would not hemper, in any way, open exchange of
pre-1933 orthography Korean text that all of us are pursuing. Moreover,
putting this additional 'composition' into the OT layout table (along
with some other places along the stream if necessary) of OT fonts would
not decrease but increase the chance of getting the identical rendering
results across platforms where OT fonts are used.

>UTR #29 is a subject that I will not address at this point. I have not
>studied it with regards to Korean, but would not be surprised if there
>are some errors present.

Are you saying that there are some errors in UTR #29. Well,
I'm not saying that it's perfect (all of us are prone to make
mistakes). However, it's NOT an error by any means for UTR #29
to say that sequences such as 'U+1100 U+AC00' are a single grapheme
instead of two. They have been always considered a single grapheme
since Unicode 2.0 (the earliest Unicode standard
for which I (used to) have a hardcopy.) Unicode 3.0 might not have
been as clear as possible about this(I think it was clear enough),
but any remaining doubt was cleared up by Unicode 3.2 section 3.11 and
UTR #29.

> It is important to know that we do not consider
>that the precomposed Jamo characters (like U+AC01) are valid inputs for
>composing an Old Hangul jamo. Thus, from my perspective there will
>*always* be a syllable boundary before and after each precomposed Jamo
>form.
>

Well, whether you consider it valid or not, UTR #29 and Unicode 3.2
section 3.11 are pretty clear that there's NO
syllable boundary in sequences like {L LV}, {L LVT}, {LV V}, {LVT T}
while the document at
http://www.microsoft.com/typography/otfntdev/hangulot/default.htm
and you consider them as two 'syllables'(graphemes) with the syllable boundary
between L and LV/LVT, LV and V, and LVT and T. Considering them
as two graphemes is a clear violation of Unicode standard you
want to abide by.

>I find this discussion very interesting because some of the behaviors
>you are describing put the output in the format of Old Hangul
>combinations in conflic with the expected behavior of modern Hangul as
>some of the composable forms are written on full spaces in an uncombined
>manner.

I'm not sure I'm following you here. Could you give an example sequence
with Unicode code points?

> This is a significant issue that has a huge impact on our
>customers. Perhaps the way this could be handled is to specify that the
>ZERO WIDTH JOINER must be used between characters that should be
>combined. That way a user could type modern Hangul as they can now with
>correct results, but still have the option of forming Old Hangul
>clusters using the same set of characters.

No, I don't think there's any need for ZWJ for Hangul. This
is not just a theoretical speculation but I do have two actual
implementations (of UTR #29 and Unicode 3.2 3.11 syllable boundary
analysis) and so far I haven't found any problem with them. As I
explained in my message and UTR #29 and Unicode 3.2 section 3.11 do
likewise, there's absolutely no need to use ZWJ for Hangul text.
Syllable boundaries can be clearly identified without using ZWJ at
all. I'm pretty sure experts on UTC jump out of their seats immediately
on hearing that ZWJ is necessary for Korean Hangul.

If they want to break U+1100 and U+AC01, they have to enter
U+115F (Hangul Jamo Choseong Filler) between U+1100 and U+AC00
to turn U+1100 into a proper syllable (U+1100 U+115F).

>It would be wonderful if you can help us understand how Old Hangul can
>work the best within the constraints of Unicode in which we must work.

I'm more than willing to help you with that. However, we have to
stand on the same ground as to what the constraints of Unicode are
before that.

As I wrote above, I believe there's a bit of 'internal inconsistency' in
Unicode and I'm hoping that that problem will be resolved before Unicode
4.0 comes out by introducing tailoring of composition/decomposition for
Hangul Jamos that will narrow the gap between Korean scripts as created
by its inventors in the 15th century and Unicode encoding model of
Korean script.

>We also need to understand how to allow the majority of users today who
>use modern Hangul and get the results they expect for current usage
>while keeping it possible for scholars and others to continue to
>represent Old Hangul with the understanding that it would be in a manner
>different than the Hangul script was originally conceived.
>

Perhaps I failed to make it clear, but what I suggested to you does not
require the majority of users to change anything they've been doing.
They can keep working exactly the same way as they do now. What I suggested
is not to replace something in the current practice with something else
but to add to what's being done. On the other hand, having 'standard'
libraries that provide the (canonical) decomposition of cluster/complex
Jamos into basic/simple Jamo sequences and embedding a similar
'appartus' into OT fonts would be a great boon for Korean
linguists who sometimes need to work at the 'genuinely atomic' level.

Jungshik

Next message: Abdij Bhat: "UNICODE-non-clashing ASCII character needed"
Previous message: Tex Texin: "Opera 7 supports UTF-32"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Apr 07 2003 - 02:03:54 EDT