RE: Re[2]: Errata in language/script list

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Tue Jul 31 2001 - 10:49:59 EDT


On Tue, 31 Jul 2001, Marco Cimarosti wrote:

> BTW, I notice that a single "Chinese" entry is listed. This should probably
> be split in several entries for the various Chinese languages (or
> "dialects", e.g. Mandarin, Cantonese, Hakka, etc.). This split may be handy
> because the different languages could need different information.

In the absence of additional qualifying information, I think "Chinese"
would be interpreted as the most salient variety, the modern standard
written Chinese (based on Mandarin Chinese; SIL "CHN") in dominant use
today by speakers of all Chinese languages.

However, some people might have questions asking for details like
"Does Unicode have traditional characters?" and/or "Does Unicode have
simplified characters?"--it might even be worth pointing out that both can
be used concurrently, which is not what people accustomed to the likes of
GB2312, Big5, etc would expect.

Still others might ask, "Does Unicode have Cantonese/Hong Kong
characters?" (the terms are not exactly synonymous, but often
interchanged). Prior to Unicode 3.1's introduction of the Han characters
in Plane 2, I'd say that support for Yue Chinese (SIL "YUH"; ~= Cantonese)
was not really usable. With a logosyllabic script, it'll never be
possible to exhaustively check that all its characters included, but it
looks very usable now--I've had high success rates finding them in
Plane 2, partially due to sourcing from the HKSCS character set (H source)
from Hong Kong, and partially due to sourcing from large dictionaries such
as the _Hanyu Da Zidian_ (G-HZ source) where characters (and the words
they transcribe) have died out in Mandarin, but are preserved in Yue and
other Chinese languages.

However, I'm not so sure what the situation is for other Chinese
languages, other than a vague impression that they are not well
supported--probably the stage that Yue Chinese was at with Unicode 2.1.
e.g., U+20547 is used only in Min Chinese (MNP, CFR), meaning 'hard,
durable', with a pseudo-Mandarin reading of dian4. It's in Unicode only
because it happened to be in HKSCS, and to my knowledge that is the only
character set it appears in, perhaps for the use of Chaozhou
speakers ("Chiuchow", "Teochew"), a linguistic minority in Hong Kong
(Chaozhou is a dialect of Minnan Chinese, CFR). U+20547 is also
documented only in very few dictionaries, none of which were
apparently a source for Unicode. I think any support for Min Chinese at
this point is probably accidental. (FYI, U+20547 looks like U+6709 with
the two center strokes removed and replaced by U+4E36.)

Thomas Chan
tc31@cornell.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 31 2001 - 12:12:16 EDT