RE: Locale ID's again: simplified vs. traditional

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Wed Oct 04 2000 - 12:52:25 EDT


On Wed, 4 Oct 2000 Jukka.Korpela@hut.fi wrote:

> Does Unicode encode traditional and simplified Chinese characters
> separately, or is the difference considered as glyph variation only,
> to be indicated (if desired) at higher protocol levels?

Disclaimer: This is written from the view of a "traditional" user,
specifically for Chinese (language).

Unicode encodes them separately. Instead of thinking in terms of pairs
like Traditional vs. Simplified, or Chinese vs. Japanese, think in terms
of a collection of characters which are by default "traditional", and that
only at a certain points in time and place were *some* of them selectively
"simplified" while others were retained as-is. In Japanese, this happens
in the post-war period. In Chinese, this happens in Communist-controlled
areas in the 1960's, while most other Chinese-writing areas remain
unchanged--thus the current situation of CN and SG vs. TW and HK. (Some
generalizations made in the above paragraph, e.g., informal
simplifications, earlier Republican simplifications, etc are temporarily
ignored).

One kind of simplification is the creation of a new character, such as
U+885B -> U+536B; like changing <before> -> <b4>; this is 1-to-1. Another
is to replace with what were formerly informal simplifications, such as
U+570B -> U+56FD; such as changing <and> -> <&> (the former acceptable in
informal contexts such as personal letters in supposed "traditional"
locales like TW/HK). Another involves merger, such as yun 'cloud' U+96F2
and yun '(archaic) to speak' U+4E91 which both are written U+4E91 after
simplification (which makes undoing it difficult; like merging <right> and
<rite> -> <rite>. (If my English analogies sound vulgar or contain
generalizations; they are meant to convey the dislike that some people
have for simplifications.) Yet another, mistaken for simplification, are
glyph changes, like U+9AA8 (see
http://charts.unicode.org/unihan/unihan.acgi$0x9AA8 in the third character
from the left).

CN/SG and JP have taken different steps towards simplification, the
"traditional" form represented by TW/HK, e.g., U+6C23 -> U+6C14 in the
CN/SG, but U+6C23 -> U+6C17 in JP.

Separate from all of this are glyph variants, as in U+9AA8 (see above).
Text printed in CN in "traditional" form (yes, it does happen) will use
the glyph with the |- like corner in the upper portion, not -|. Chinese
generally do not care about glyph variants as much as Japanese, though.

 
> My mental model is the following:
> - there is a very large number of Chinese characters in use

True. For the Chinese situation, secondary education covers some
2-3000 (add some 1-2000 more for the highly literate, like those into in
literature or history); some 5-8000 in a desktop dictionary; some
15-20,000 for the threshold of unique ones; some 50,000+ if one includes
lots of obscure variants of that 15-20,000. (All estimates, here.)

> - some of them are encoded in Big5, and some of them are encoded
> in GB standards

True, and there is overlap between the two. (I assume "GB" here means
GB2312.)

> - Big5 is intended for use with display as traditional glyphs and
> GB for simplified, but there is no logical necessity to that
> (though in practice recoding would be needed in order to display
> data in the other encoding)

This is a confusing question; see my explanation above.

> - Unicode contains the union of Big5 and GB characters

True, as well as the union of other character sets.

> - you could thus recode Big5 and GB to Unicode, and you could leave
> the glyph issue unspecified (so that the recipient user could
> select either traditional or simplified glyphs).

False. The recipient can only choose CN/TW/JP/etc locale-based glyphs
(see the U+9AA8 example above).

Thomas Chan
tc31@cornell.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT