Re: Compressing Vietnamese with SCSU

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Apr 18 2000 - 17:43:25 EDT


Linus Tanaka responded to Doug Ewell:
>
> > I tested my newly written encoder with a fairly large Vietnamese file
> > (59,871 characters) and came up with the following results:
> >
> > VISCII: 59,871 bytes
> > UTF-16: 119,744 bytes (including BOM)
> > UTF-8: 79,269 bytes
> > SCSU: 67,781 bytes
>
> VISCII utilizes some codepoints in 0x00 - 0x1F. Have you checked the
> size in other Vietnamese encoding which doesn't utilize that area?
>

I think it would be appropriate also to check Windows Code Page 1258,
which uses a hybrid strategy of precomposed characters for the base
vowels, plus combining marks for the tones. If you convert Windows
1258 data directly to Unicode, you avoid all the 1EXX code points, and
might get better behavior with SCSU. (Although you also start out with
more voluminous data, since the tones are separately represented.)

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT