From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Dec 01 2003 - 08:43:51 EST
On 01/12/2003 04:25, Philippe Verdy wrote:
> ...
>
>And what about a compressor that would identify the source as being
>Unicode, and would convert it first to NFC, but including composed forms
>for the compositions normally excluded from NFC? This seems marginal but
>some languages would have better compression results when taking these
>canonically equivalent compositions into account, such as pointed Hebrew
>and Arabic.
>
>
>
>
To get an idea of what orders of magnitude we are talking about here:
The Hebrew Bible consists of about 2,881,000 Unicode characters
including accents, or 2,632,000 excluding accents - these figures
include spaces. Of these, about 172,000 are U+05BC dagesh or mapiq,
46,000 are shin dot and 12,000 are sin dot. All of these, or very
nearly, can be canonically composed with the preceding base characters
into characters FB30-FB4A, thus saving 230,000 characters. Also a
significant number of combinations could be composed into FB2E, FB2F and
FB4B. So the Hebrew text could be compressed by something around 10%
simply by composing it using characters already defined. This compressed
version is canonically equivalent to the uncompressed version, but is
not normalised because the characters are in the composition exclusion
table.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Mon Dec 01 2003 - 09:28:45 EST