Re: Hanzi trad-simp folding and z-variants

From: john knightley <john.knightley_at_gmail.com>
Date: Sun, 9 Jun 2013 15:48:57 +0800

On Sun, Jun 9, 2013 at 1:26 PM, Stephan Stiller
<stephan.stiller_at_gmail.com>wrote:

>
> Though some confusion as what other questions are being discussed here.
>
> I think I misused the expression "folding" at some point. But the original
> query explicitly asked about "do[ing] traditional to simplified folding for
> indexing and query processing (*when the mapping is unambiguous*)" (emph
> added) so I wasn't really sure where parts of the discussion were going :-)
>
>
No problem.

>
> Japanese has well established traditions for simplifying CJK ideographs
> which are not identical to Chinese if one was to use a folding approach to
> deal with simplifications then there should be differences for Chinese and
> Japanese.
>
> I think the kyūjitai-shinjitai mappings are not in Unihan. (Compare the
> entries of 廣 (U+5EE3) and the characteristically Japanese character 広
> (U+5E83).) I know that certain contexts retain older forms (KenL talks
> about this somewhere too). Btw if you know about other mappings or good
> resources, I'll be curious to know.
>

No but of course also interested to know what is available.

> "quite well documented" is a relative term
>
I highly respect the work in Cheung & Bauer, but it makes no attempt to
> tell us how easily understood the characters are. Many of them are ad-hoc
> coinages that are not understood by any of my informants; sometimes for say
> 6 ways of writing a syllable-morpheme, I can make my informants tell me
> that perhaps *one* of them is passable. This problem isn't easily solved,
> but then the source isn't helpful in knowing which out of the approx 1000
> characters are actually used nowadays. I won't give you a number, as I'd
> have to check more carefully to be quotable. The number of morphemes for
> which there truly seems to be no written representation is *very* low,
> but often the characters in existence aren't exactly comprehensible to many
> native speakers either, and not all of them are unambiguous. This will give
> you an idea.
>
>
   It documents 1,095 different Cantonese characters. Familiarity with a
writing system makes the "non-obvious" parts comprehensible, as can
context. Some Cantonese characters, as for Sawndip by their construction
tend to be ambiguous which often means 'something which sounds like this
known character", and therefore the meaning must be learned.

> Zhuang Sawndip
>
> Sounds exciting.
>
>
Yes, no shortage of new material to get ones teeth into.

>
> By best choice do you mean (a) the person producing the electronic
> form was unable to use the character they wished
> because either it is not yet in Unicode (b) even though in Unicode the
> person was did not know how to type it so type another character instead
> (c) a less than perfect, or ambiguous, 'spelling' . All of which are
> found both for Sinitic languages and non-Sinitic languages when written in
> CJK ideographs, be it printed publications, web-pages or text messages
> between native speakers.
>
> Nearly all of Cantonese is in Unicode and therefore typeable in theory
> (though some people will not be used to such writing, but I'm sure you know
> this), so it's not (a). I would say it's largely (c) (people will often
> make up their own plausible thing), even though (b) is a reason too.
>
>
   Many smart phones whilst having the infrastructure lack either the IME
or font for Cantonese characters in the SIP.

    For Zhuang Sawndip Unicode support is very lacking at present, on
average over 10% of the text on a page uses characters not yet in Unicode
(a), and with about 2% of text coming from SIP so typing is often a
challenge for many(b).

>
> Not standardize does not mean totally beyond analysis or processing,
> or even necessarily that confusing to a native speaker, they are not
> random, though admittedly more complex than a standardized locale.
>
> Yes. And we both agree that standardization is desirable.
>
>
Yes.

John

> Stephan
>
>
Received on Sun Jun 09 2013 - 02:52:28 CDT

This archive was generated by hypermail 2.2.0 : Sun Jun 09 2013 - 02:52:29 CDT