Re: Hanzi trad-simp folding and z-variants

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Sat, 08 Jun 2013 06:00:15 -0700

I.
>
> Which and where?
>
> Section 3.7.1 Simplified and Traditional Chinese Variants talks about
> converting between Simplified and Traditional Chinese.
You wrote this
>
> http://www.unicode.org/reports/tr38/ does a good summary of
> the possibilities.
>
in response to my inquiry about "examples of meaning-divergent z-variant
words in modern Mandarin" and appropriate "algorithms and data
structures". Also, the Unihan database doesn't provide collocational
data for T/S conversion.

II.

> simplification is also found in for example Japanese CJK ideographs
> which is documented
Contextual conversion (and shifting/"transposition") is essentially not
an issue in this context, even though you have an odd case of deviation
here and there.

> Some dialects such as Cantonese are quite well documented
[and]
> There is an increased interest in such things in recent years. One
> persons 'hand-tuned' of today can become the basis of a standard of
> tomorrow.

1a. I'd say I have a decent grasp of the topic of lexical variation for
written Cantonese, based on a decent amount of fieldwork. (While we're
at it, I also know at least one researcher with an interest in
standardization of Cantonese spelling.) I'm certain that lexical
variation in Cantonese is not well-documented, though there are a bunch
of sources from which you can scrap your own thing together.
1b. Keep in mind that most materials in electronic form (originally
written in this form or digitized) don't use the "best" character
choices – needless to say it's gotta be even truer for other Sinitic
languages.
2. This is entirely unrelated to the question of whether one can or
should describe simplified characters as "abbreviated". There is a
connection to your statement about things being on a sliding scale (you
used the word "relative"), but for Cantonese it's more like this
translates into a lot of inconsistency between using genuine C spelling,
a M substitute, a C-based phonetic transcription, ad-hoc usage using the
mouth radical or a prefixed roman "o", an English-based informal
transcription using Latin letters, and avoidance. Whether this is
electronically manageable in principle depends on whether you include
entirely romanized blogs (which I wouldn't recommend), but – in any case
– anything other than liberal QE (query expansion) will /not/ work. (I
might previously have misused the word "folding" to mean "conversion".)
3. Other Sinitic languages are essentially not at all standardized
(we're talking Chinese characters here, not romanizations). Last time I
checked it seemed like Taiwanese is a total mess, and Shanghainese has a
(mainland-CN) researcher who is (still) writing a dictionary to actually
find or document written representations of all syllable-"morphemes" to
capture all of SHnese. The best SHnese textbook was published a couple
of years ago in HK and uses traditional characters (!) to represent
modern SHnese.

Stephan
Received on Sat Jun 08 2013 - 08:03:45 CDT

This archive was generated by hypermail 2.2.0 : Sat Jun 08 2013 - 08:03:46 CDT