Re: Worst case scenarios on SCSU

From: [email protected]
Date: Thu Nov 01 2001 - 01:08:40 EST

Next in thread: Asmus Freytag: "Re: Worst case scenarios on SCSU"
Reply: Asmus Freytag: "Re: Worst case scenarios on SCSU"
Reply: [email protected]: "Re: Worst case scenarios on SCSU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

It must be a full moon on Halloween, because here I am in the extremely
unfamiliar position of disagreeing quite strongly with Ken Whistler.

In a message dated 2001-10-31 17:16:25 Pacific Standard Time, [email protected]
writes:

> As current Czar of Names Rectification, I must start protesting
> here. SCSU is a means of *compressing* Unicode text. It is
> not "[an]other method of encoding Unicode characters."

I was about to reply, "Of course it is," before I realized that Ken was
interpreting the word "encoding" in the strictest sense, invoking the
distinction between character encoding forms (CEFs) and transfer encoding
syntaxes (TESs). In some cases this is a worthwhile distinction, but I don't
think it is relevant in the case of David's query, or, for that matter, in
many other cases where users may think of Unicode text being "represented" as
UTF-32, UTF-16, UTF-8, SCSU, ASCII with UCN sequences, or even (God forbid)
CESU-8.

SCSU is indeed another method of "representing" Unicode characters, if not
necessarily "encoding" them in the strict sense of the word.

> And before going on, I'm not clear exactly what you are
> trying to do. SCSU is defined on UTF-16 text. It would, of
> course, be possible to create SCSU-like windowing compression
> schemes that would work on UTF-32 or UTF-8 text, but those are
> not part of UTS #6 as it is currently written.

Like David, I don't see how SCSU is defined on, or limited to, UTF-16 text,
except in the sense that literal or quoted "Unicode-mode" SCSU text is
UTF-16. SCSU is defined on Unicode scalar values, which are not tied to a
particular CEF.

You can define an window in what SCSU calls "the expansion space" using the
SDX or UDX tag and, in the best case, store N characters of Gothic or Deseret
text in N + 3 bytes. None of this has anything to do with surrogates or
16-bitness.

In a message dated 2001-10-31 17:59:33 Pacific Standard Time, [email protected]
writes:

> I have no quarrel with the claim that the SCSU scheme could be
> implemented directly on UTF-32 data. But as Unicode Technical Standard
> #6 is currently written, that is not how to do it conformantly.

I have looked throughout UTS #6 and cannot find anything, explicit or
implicit, to the effect that SCSU could not be conformantly implemented
against UTF-32 data. Sections 6.1.3 and 8.1 refer to how "surrogate pairs"
may be encoded (*) in SCSU, but if you substitute the phrase "non-BMP
characters" the meaning is identical.

(*) The word "encoded" was taken directly from UTS #6, section 8.1.

> At the moment, if you want to compare SCSU-compressed text
> against the UTF-32 form, you would have to convert the UTF-32
> text to UTF-16, and then compress it using SCSU. You don't
> apply SCSU directly to UTF-32 data.

Why not? The fact that UTS #6 was originally written before UTF-32 was
formally defined has nothing to do with this. The same could be said for
UTF-8, which (like SCSU) has a surrogate-free mechanism for representing
non-BMP characters.

> It seems to me that a rewrite of SCSU would be in order to explicitly
> allow and define UTF-32 implementations as well as UTF-16 implementations
> of SCSU.

I don't see anything that needs rewriting. What are you seeing?

-Doug Ewell
Fullerton, California

Next in thread: Asmus Freytag: "Re: Worst case scenarios on SCSU"
Reply: Asmus Freytag: "Re: Worst case scenarios on SCSU"
Reply: [email protected]: "Re: Worst case scenarios on SCSU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Thu Nov 01 2001 - 02:05:52 EST