Re: SCSU/BOCU-1 Compressibility of the Yi syllabary

From: Doug Ewell (dewell@adelphia.net)
Date: Fri Jul 15 2005 - 02:50:49 CDT

Next message: Johannes Bergerhausen: "design prototype: the ultimate unicode keyboard?"

Previous message: Richard Wordingham: "RE: Arabic encoding model (alas, static!)"
In reply to: Richard Wordingham: "SCSU/BOCU-1 Compressibility of the Yi syllabary"
Next in thread: Richard Wordingham: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Reply: Richard Wordingham: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

>> If by "Asian texts" you mean CJK ideographs (*), precomposed Hangul,
>> or Yi syllables, you have no chance of doing better than 2 bytes per
>> character. This is because it is not possible in SCSU to set a
>> dynamic window to any range between U+3400 and U+DFFF, where these
>> characters reside. Such a window would be of little use anyway,
>> because real-world texts using these characters would draw from so
>> many windows that single-byte mode would be less efficient than
>> Unicode mode, where 2 bytes per character is the norm. Of course,
>> this is still better than UTF-32 or UTF-8 for these characters.
>
> Has there been any investigation of how badly the Yi syllabary would
> compress under SCSU if dynamic windows were available for it? Actual
> BOCU-1 results might give a good indication. With only 0x4C7
> syllables, Yi might perform better than one might expect. Possible
> reasons for improvement might be:
>
> 1) Both syllables of alliterative compounds would often be in the same
> SCSU (or BOCU-1) window.

SCSU does not allow the setting of a dynamic window anywhere within the
Yi range (U+A000 through U+A4C6). The only way to encode Yi text in
SCSU is to use "Unicode mode," encoding each character in 2 bytes (MSB,
LSB). This is stated in the text you quoted.

It's possible that some sequences of Yi might benefit from being
encodable in a dynamic window, but since it is not possible to do so,
the point is moot.

> 2) Any leakage of ASCII into Yi in single-byte mode would result in
> the ASCII being encoded at one byte per character, rather than two
> bytes per character.

Sufficiently long sequences of ASCII characters might justify a switch
out of Unicode mode into single-byte mode, where the compression thus
gained would be justified.

> I'd be happy to do the analysis myself if someone could point me to
> representative Unicode-encoded texts. (I'd do the SCSU test by
> transposing the scalar values from A000 onwards to 2200 onwards.) Of
> course, the quality of a SCSU compressor could make a big difference
> with a script like the Yi syllabary. For example, a simple tweak to
> my SCSU encoder improved Inuktitut (Canadian Aboriginal Syllabics)
> performance from 1.54 to 1.49 bytes per character, and my encoder
> deliberately keeps its state small - one byte look-ahead and no
> statistics.

This is different, because a SCSU window can be set to the Canadian
Syllabics range. Likewise for Ethiopic.

--
Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Johannes Bergerhausen: "design prototype: the ultimate unicode keyboard?"
Previous message: Richard Wordingham: "RE: Arabic encoding model (alas, static!)"
In reply to: Richard Wordingham: "SCSU/BOCU-1 Compressibility of the Yi syllabary"
Next in thread: Richard Wordingham: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Reply: Richard Wordingham: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jul 15 2005 - 02:52:33 CDT