Re: SCSU question

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sat May 13 2000 - 18:19:25 EDT


At 12:32 AM 5/12/00 -0800, Vaintroub, Wladislav wrote:

>If I compare (binary) 2 strings encoded in SCSU ,will the result be the
>same as if I compare corresponding
>Unicode strings ?

No, the SCSU encoder has multiple options to choose from. However, if you
run the same text through the same encoder twice, it should result in the
same binary sequence (unless someone deliberately created an encoder with
random elements in it).

For example it is always possible to not compress at all, by using the SCU
tag and the pass all remaining data as UTF-16. The result is a valid SCSU
string, but needless to say most encoders would not choose this except for
hangul syllables or ideographs. Therefore, comparing a string encoded by
the encoder of this example would not compare equal with a string produced
by most other encoders.

There is another problem in that it is conceivable (I haven't tried to
construct a case) that two *different* substrings accidentally result in the
same byte sequence, even when using the same encoder.

Substrings that are the same in the source data, *will* encode to different
substrings in the SCSU encoded stream, if the characters that come before
leave the encoder in a different state.

A./

>As far as I understand, this should work for some general scripts ( Latin
>,Greek or Cyrillic and so on).
>What about text with mixed scripts?
>Is there a general rule about SCSU-comparison? Or it will always depend on
>SCSU-implementation?
>
>W.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT