> If I have some all-kana documents ..., is there an
> extension of UTF-8 that will alow me to strip off the redundant "this is
> kana" byte from most of the kana?
No.
> After the first few thousand kana, it
> might be like, "Yeah, we get it already! It's kana! It's KANA!! You can
> stop reminding us now!!"
If I decide to emulate the Buddha and fill text files with a million
DEVANAGARI OM symbols in a row, each instance is still U+0950, whether
represented in UTF-16 or UTF-8 (or UTF-32, for that matter).
Stop thinking in terms of bytes and start thinking in terms of
characters.
For that matter, say you were reading the genetic code:
ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine;
ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine;
ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine;...
Yeah, we get it already! It's methionine! It's METHIONINE!! You can
stop reminding us now!!
A code is what it is.
>
> This goes too for Hebrew, Greek, etc.
What you are looking for are text compression algorithms. See UTS #6,
A Standard Compression Scheme for Unicode.
--Ken
This archive was generated by hypermail 2.1.2 : Mon Mar 04 2002 - 19:57:18 EST