Re: Compressing Vietnamese with SCSU

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Apr 18 2000 - 17:43:25 EDT

Next message: Mark Bishop: "key unicode elements to have in new book"
Previous message: Linus Toshihiro Tanaka: "Re: Compressing Vietnamese with SCSU"
Maybe in reply to: Doug Ewell: "Compressing Vietnamese with SCSU"
Next in thread: Doug Ewell: "Re: Compressing Vietnamese with SCSU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Linus Tanaka responded to Doug Ewell:
>
> > I tested my newly written encoder with a fairly large Vietnamese file
> > (59,871 characters) and came up with the following results:
> >
> > VISCII: 59,871 bytes
> > UTF-16: 119,744 bytes (including BOM)
> > UTF-8: 79,269 bytes
> > SCSU: 67,781 bytes
>
> VISCII utilizes some codepoints in 0x00 - 0x1F. Have you checked the
> size in other Vietnamese encoding which doesn't utilize that area?
>

I think it would be appropriate also to check Windows Code Page 1258,
which uses a hybrid strategy of precomposed characters for the base
vowels, plus combining marks for the tones. If you convert Windows
1258 data directly to Unicode, you avoid all the 1EXX code points, and
might get better behavior with SCSU. (Although you also start out with
more voluminous data, since the tones are separately represented.)

--Ken

Next message: Mark Bishop: "key unicode elements to have in new book"
Previous message: Linus Toshihiro Tanaka: "Re: Compressing Vietnamese with SCSU"
Maybe in reply to: Doug Ewell: "Compressing Vietnamese with SCSU"
Next in thread: Doug Ewell: "Re: Compressing Vietnamese with SCSU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT