RE: Locale ID's again: simplified vs. traditional

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Fri Oct 06 2000 - 12:42:56 EDT


On Thu, 5 Oct 2000, Carl W. Brown wrote:

> I think that we both agree that you can not just lump Chinese into a single
> locale. Would you agree that it is also a multi-dimensional problem because
> you have dialect differences and script differences both are factors affect
> the code points used had the processing of the text not just the fonts.
>
> The IME is a good example. Some IME use radicals but others use phonetic
> systems. If I use an IME that is using a phonetic system such as Pinyin
> then I am tied to a dialect to match sounds to the proper characters. I am
> also tied to script because the different scripts have different characters
> and code points. Therefore without this information I can not select an IME
> with the proper dictionary or code point conversion. Besides pronunciation
> the dictionaries also carry local phraseology and cultural differences.

Yes, even among just "phonetic" IME's there can be a lot of variation (not in
any particular order, nor a definitive list):

1) language/dialect base for readings of characters
2) locale
3) transcription system
4) keyboard layout
5) script
6) encoding and character set

What language/dialect base is one using for the readings of characters? Let's
choose Mandarin, which has the most speakers, native or otherwise. (We'll
ignore Cantonese-based or other inputs in this discussion, although they
have their own place.) But which "standard" of Mandarin are we designing for?
e.g., the characters U+5783 and U+573E, which are used to write the word
'garbage', have different readings in China and Taiwan's standards. You could
tailor your IME to one of the two, or allow both. (You might want to do this
anyway, since some characters have multiple readings normally within a single
standard, like 'blood' U+8840, which can be xie3 or xue4). If you're nice, you
may also include functionality to allow common variation and "errors" so that
non-native speakers or native speakers who speak a regional form may use your
product with ease, e.g., collapsing n- and l-, -n and -ng.

What transcription system is one using, now that we've converted each character
to a phonetic form? For Mandarin, there's Pinyin, a romanized system,
and Bopomofo, a semi-syllabary. There's also a bunch of less common systems,
like Wade-Giles or Gwoyeu Romatzyh--practically any method of transcription
could be implemented. After that, one must decide on a keyboard layout--
a romanized system like Pinyin may be input normally, or could be abbreviated
to a 2-3 keystroke form; a system like Bopomofo (usually named as "Zhuyin input")
has at least three keyboard arrangements from various vendors.

What script is one using? Simplified or Traditional? This will affect the
assortment of characters offered to the user, as does the availability of
a character in whatever character set one is using. (You might want to
provide a pruned list, so that ultra-obscure characters are not cluttering
the user's choices, and the fact that some characters are not used in
the language/dialect base you've chosen, like Cantonese-specific characters
or Japanese and Korean "national characters" having no place nor even a
pseudo-Mandarin reading to be inputted.) In a legacy system, encoding and
character set will be another issue--what does all of the above generate as
output?

The above only covers the case of single character-by-character input; if
one includes functionality to allow conversion of compounds or sentences
to characters (some IME's out there now don't, which makes typing
laborious), with or without specifying tone (another option that an
IME might have--is specifying tone mandatory?), then one will need a
dictionary, and as you remark, the contents of that dictionary will
vary, based on language/dialect base and locale--and that is not
including supplementary dictionaries for specialized uses like jargon
or technical terms, or the user's private dictionary. (This is like a
spell-checker of sorts.)

Unfortunately, many of the above variables are collapsed in IME's, so
that one might only get a choice between a Mandarin-China-Pinyin-
Simplified-GB2312 IME or a Mandarin-Taiwan-Zhuyin-Traditional-Big5 IME.
e.g., in North America, many people want to use "Traditional" (for
various reasons, some of my professors' included), but they'd
prefer Pinyin input, however, the choice just isn't there. (Taken to
an extreme, some enlightened software does through the "magic" of
Unicode even cross boundaries to allow, say, inputting Chinese langauge
text in something like Shift-JIS, which is intended for Japanese
language text.)

Thomas Chan
tc31@cornell.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT