From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Fri Feb 14 2003 - 11:42:23 EST
On Fri, 14 Feb 2003 07:45:44 -0800 (PST), Thomas Chan wrote:
> I think zhe4 'this' (simp U+8FD9 / trad U+9019) might be better for a very
> simple heuristic for modern text, since it occupies position #11 in at
> least one frequency list (compared to #15 for the above-cited ge4), and as
> far as I know, U+8FD9 is not one of those ancient characters that have
> been promoted/reused as a simplified form.
On the other hand I don't think that zhe4 is used in Cantonese, whereas I think
that ge4 is, so it wouldn't be so good for pages written in Cantonese (not that
I have ever seen any, but I'm sure there must be some). Probably even a simple
heuristic would need to try several common characters such as ge4 and zhe4.
> Aren't such texts by default "traditional"? "Simplified" text, besides
> using simplified form characters, usually also entails refraining from
> using variant forms (according to PRC definitions of what is a variant).
Probably true, but the point that I was making is that the simplified ge4 in the
text would confuse a simple heuristic.
> There are even some cases of semi-simplified forms where one half of a
> character might have been simplified according to pre-1964 rules, but the
> simplification rule for the other half has to wait until 1964. But I
> think these might've been missed by Unicode, like some of the
> ultra-simplified forms in the short-lived 1977 scheme, and Singapore's
> temporarily different (from the PRC's) schemes prior to 1976.
I think that most of the 1977 simplifications have already been encoded in
Unicode, but any that haven't and the hybrid semi-simplified forms found in some
printed books from the 50s and 60s will probably be included in CJK-C along with
the rest of its unnecessary baggage (excuse my distaste for CJK-C, but I think
that the Ideographic Rapporteur Group is indiscrimately collecting characters
that in most cases probably do not needed to be encoded, just for the sake of
encoding as many characters as possible - 24,000+ and counting - see the "CJK
Extension C Project" at http://www.cse.cuhk.edu.hk/~irg/irg/extc/CJK_Ext_C.htm
for details).
> >Now if Hanyu Da Cidian were to be put onto the internet ...
>
> How about the one here? <a
href="http://202.109.114.220">http://202.109.114.220>/
Yes, this is an excellent resource. Although the Hanyu Da Cidian look-up only
gives definitions, and none of the extremely useful quotations found in the
printed book, it still mixes traditional form head words with simplified
definitions, so that both ge4 simplified and traditional are found together on
the same page if you search under U+500B and look at the appended compound
words. I guess that according to Thomas's definition of Simplified Chinese, this
makes it a Traditional Chinese page, even though most of the text is in
simplified Chinese !?
Incidentally, for those interested in UTF-16 Chinese web pages, I noticed that
this site is encoded as UTF-16LE.
On a related matter, I was wondering about language tagging for Chinese. "zh-CN"
and "zh-TW" are used quite frequently, but what do they imply ? Is an HTML page
tagged as "zh-CN" expected to be composed of simplified characters, and a a page
tagged as "zh-TW" expected to be traditional characters ? Or does the CN or TW
imply nothing about the orthography of the text, in which case the CN or TW may
simply allow selection of an appropriate font ? What if I am writing a Chinese
page here in England - should I put "zh-UK" or should I make a political
decision as to whose side I'm on, and use "zh-CN" or "zh-TW" ?
On the other hand, "zh-simplified" and "zh-traditional" are sometimes found.
These tags are less politically charged, but miss out on mixed
simplified/traditional pages. Is there a "zh-mixed" ?
Andrew
This archive was generated by hypermail 2.1.5 : Fri Feb 14 2003 - 12:57:45 EST