InLanguage properties? [Was Re: Encode-InCharset-0.01 Released]

From: Dan Kogai (dankogai@dan.co.jp)
Date: Fri May 03 2002 - 04:52:37 EDT


On Friday, May 3, 2002, at 04:33 , Roman Vasicek wrote:
>> On Friday, May 3, 2002, at 02:41 , Dan Kogai wrote:
>>
>> I have just released Encode-InCharset-0.01, available as
>>
>> http://www.dan.co.jp/~dankogai/Encode-InCharset-0.01.tar.gz and CPAN.
>>
>> I have developed this module primarily to implement ISO-2022-JP-3 and
>> ISO-2022-CN in future. To implement encode() in these, you have to
>> know which character set a given character belongs. But this module
>> can also be used if a string can safely be encoded
>> (Though fallback is much faster).
>>
> Great! Good work.
>
> I have one, may be off topic question. Is there module which provide the
> same functionality for languages? I mean something like IsGerman,
> IsCzech,
> etc.

   Be our guest ;) To my knowledge there is none but it won't be too
hard to implement -- for Roman script languages. You just start with
ISO_8599 variants and subtract the ones you don't need.

   I consider this be one of the problems of Unicode (as of now). When
you aggregate anything, usually the source of origin is lost. It is
just the same as you can't retrieve 1+1 back from 2 (it could be 0+2 or
-1+3 or anything).
   To overcome this shortage Unicode does have character properties and
you can get which I<script> it belongs to using that. But unfortunately
that was not the case for the origins of character repertoire (so I made
one (Encode-InCharset) because I needed it). Neither is the case for
Languages.
   Maybe Encode-InCharset-0.01 can help implement InLanguage, especially
for complex CJK cases. Here is a crude (and possibly incorrect)
definition of InNihongo;

$InNihongo =~ qr/(?=
                                \p{InJISX0213_1} |
                                \p{InJISX0213_2} |
                                \p{InASCII}
                                )
                           (?:
                                \p{Hiragana} |
                                \p{Katakana} |
                                \p{Han} |
                                \p{InBasicLatin} | # contemporary!
                    )/xo;

Notice it is prepended by InJISX0213_1 and InJISX0213_2. Otherwise all
Han Ideographs that are not used in Japanese will also be considered
Nihongo.

Dan the Encode Maintainer



This archive was generated by hypermail 2.1.2 : Fri May 03 2002 - 05:46:25 EDT