Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

From: Jianping Yang (Jianping.Yang@oracle.com)
Date: Sun May 27 2001 - 21:03:37 EDT

Next message: Y M Chan: "Re: Unicode&Chinese"
Previous message: Richard Cook: "Re: name this hanzi"
In reply to: Peter_Constable@sil.org: "Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Next in thread: Antoine Leca: "Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Reply: Antoine Leca: "Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Reply: Michael \(michka\) Kaplan: "Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I don't want to argue on this lengthy email, but only point two facts:

>According to the proposal, UTF-8S and UTF-32S would not have the same
>status: they wouldn't be for interchange; they'd just be for representation
>internal to a given system, like UTF-EBCDIC (which, I think I heard, has
>not actually been implemented by IBM in any live systems).
Actually Oracle supports UTF-EBCDIC from Oracle 8i and it provides Unicode
support for EBCDIC platform easily for multibyte code path nearly without any
coding rework.

>The main point seems to be that Oracle et al want to maintain Premise B,
>presumably because they think it would be easier. Yet I think I've shown
>that it doesn't, both because it creates new encoding forms to deal with,
>and because we still have to deal with the reality that the original
>encoding forms still exist. Now, I'm not a database developer, so I need to
>be careful since I can't presume to know the particular implementation
>needs of such environments. But it seems to me that we've lived without
>Premise B in the past, and that it won't benefit us to adopt it now. Why
>bother with it? Why not continue doing what we already know how to do?
As a matter of fact, the surrogate or supplementary character was not defined
in the past, so we could live without Premise B in the past. But now the
supplementary character is defined and will soon be supported, we have to
bother with it.

Regards,
Jianping.

Peter_Constable@sil.org wrote:

> >If you think something abominable is happening, please raise a loud voice
> >and flood UTC members with e-mail and tell everyone what you think and why
> >you think it. Nobody can hear you when you mumble.
> >
> >And it helps if you have solid technical and philosophical arguments to
> convey.
>
> Well, I wasn't going to elaborate (just been through this elsewhere)...
>
> The Unicode flavour of UTF-8 only allows for sequences of up to four code
> units in length to represent a Unicode character, in contrast to ISO's six,
> the difference having to do with Unicode having limited the codespace to
> U+10FFFF (whereas ISO 10646 formally includes a codespace up to U+7FFFFFFF,
> but will be effectively restricting use to U+10FFFF).
>
> BUT...
>
> >There was another abomination proposed. Oracle rather than adding UTF-16
> >support proposed that non plane 0 characters be encoded to an from UTF-8
> by
> >encoding each of the surrogate pairs into a separate UTF-8 character.
>
> Yes, Oracle, PeopleSoft and SAP submitted a proposal to UTC to sanction
> another encoding form, UTF-8S, that would encode supplementary plane
> characters as six bytes, three corresponding to each of a UTF-16 high and
> low surrogate. The rational had to do with having an 8-bit encoding form
> that would "binary" sort in the same way as UTF-16.
>
> (Warning: This gets a bit long. I'm doing this because I was advised not to
> mumble but to speak up. :-)
>
> The issue is this: Unicode's three encoding forms don't sort in the same
> way when sorting is done using that most basic and
> valid-in-almost-no-locales-but-easy-and-quick approach of simply comparing
> binary values of code units. The three give these results:
>
> UTF-8: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
> UTF-16: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
> UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
>
> This is seen by the proposers to be a problem: if you have data from one
> source in UTF-8 and another in UTF-16 and you sort the two, you'd like to
> be able to compare results from each source and know that you're sorting
> things that are comparable. By using a UTF-8 variation ("UTF-8S" in which
> supplementary-plane characters are mapped first to UTF-16 surrogates and
> from there to 8-bit code unit sequences), then the resulting ordering is:
>
> UTF-8S: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
> UTF-16: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
> UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
>
> The implication would be that we'd have *two* 8-bit encoding forms.
>
> A suggestion was made on the unicoRe list by one of the proponents that all
> encoding forms should "binary sort" in the same way; I found it surprising
> that they were proposing a variant of UTF-8 rather than UTF-16, since
> UTF-16 was the odd one out, and tweaking UTF-8 still leaves one encoding
> form that "binary sorts" differently: UTF-32. Well, I was aghast when I
> read the actual proposal to see that, not only are they suggesting that we
> have a second 8-bit encoding form, "UTF-8S", but they also want to have
> another 32-bit encoding form "UTF-32S". So, we'd end up with a total of 5
> encoding forms: UTF-8, UTF-8S, UTF-16, UTF-32, UTF-32S.
>
> According to the proposal, UTF-8S and UTF-32S would not have the same
> status: they wouldn't be for interchange; they'd just be for representation
> internal to a given system, like UTF-EBCDIC (which, I think I heard, has
> not actually been implemented by IBM in any live systems).
>
> What I don't get is this: if you want to implement something just inside
> you're own system and you say you'll make sure nobody else ever sees it,
> why do you need UTC to sanction it in any way?
>
> The crux of the justification offered by Oracle et al is this argument,
> which appears to me to be fallacious:
>
> <quote>
> Specifically, any system that must deal with the indexing and comparison of
> large collections of data across multiple encodings will run into the issue
> that data that is ordered based on the binary values in one encoding will
> no longer be ordered such once transformed into another encoding. While
> this lack of binary ordering compatibility across different encodings is
> very true and well-understood in the world of legacy encodings (such as a
> transcode of Shift-JIS to EUC-JP), given that all the Unicode
> Transformation Forms are maintained by a single committee, it should be
> possible to come up with a common binary order between each of the three
> main Unicode Transformation Forms.
> </quote>
>
> Summarising:
>
> Premise A: the three main Unicode encoding forms are maintained by a single
> committee
> Claimed implication C: it should be possible to come up with a common
> binary order between each of the three Unicode encoding forms
>
> There is a missing and implied premise that is needed to make the
> implication work:
>
> Premise B: encoding forms maintained by a single committee should all yield
> a common binary order.
>
> This argument seems to me to be faulty in at least two ways:
>
> First, it is clearly counterexemplified by existing situations
> - e.g. existing Microsoft codepages (euro binary sorts before ellipsis in
> cp1252 but after it in cp1251)
> - I'm sure it wouldn't be hard to produce counterexamples from the work of
> JTC1/SC2
>
> Now, I think Oracle et al would offer as a rebuttal that Premise B was not
> true in the past, but that it should be for Unicode. They offer no real
> argumentation as to why this should be the case, though. They simply assume
> that this will be easier for everyone (not at all obvious). I'll revise
> Premise B to reflect this:
>
> Premise B (rev'd): Encoding forms maintained by UTC should all yield a
> common binary order.
>
> This is important, and will come back up later.
>
> Secondly, the argument presupposes that the desired result as described in
> C is (i) possible and (ii) achieved by their proposal. Their proposal does
> not cause the three main encoding forms to yield a common binary order;
> what their proposal does is introduce two new encoding forms (and give them
> a somewhat ambiguous status) that will share a common binary order with
> UTF-16. The existing encoding forms remain, and continue not to share a
> common binary order. I think it is self-evident that the desired result
> can, in fact, only be achieved in one of the following ways:
>
> Drastic Measure 1: make UTF-16 obsolete; replace it with UTF-16a which
> binary sorts supplementary-plane characters after U+E000..U+FFFF
>
> Drastic Measure 2: make UTF-8 and UTF-32 obsolete; replace them with UTF-8
> and UTF-32.
>
> These are, of course, impossible (and fortunately Oracle et al are not
> proposing either of these).
>
> Now, I've argued against a strict interpretation of what they said. Let's
> consider the spirit of what they're saying: to store data internally using
> UTF-8S or UTF-32S so that they can compare binary-sort results with data
> sources encoded using UTF-16. The proposal involves encoding forms with
> very ambiguous, quasi-official status:
>
> <quote>
> This paper... proposes to add at least one optional additional UTF, in the
> form of a Unicode Technical Report (UTR). This form could be implemented
> by system designers where the benefit of a common binary ordering across
> different UTFs is important from a performance standpoint, or for other
> reasons. The new UTF(s) would have equivalent standing as the UTF-EBCDIC
> transformation currently maintained in UTR#16. It is not proposed that the
> new transformation form(s) become Standard Annexes (UAX), nor would they be
> proposed for inclusion in ISO 10646.
> </quote>
>
> For sake of argument, I'll ignore this for the moment. They offer some
> usage scenarios; I'll quote an excerpt of only the first (the other adds
> nothing new to the argument for or against):
>
> <quote>
> UTF-8 database server 哙 UTF-16 database client
>
> A SQL statement executed on the database server returns a result set
> ordered by the binary sort of the data in UTF-8, given that this is the
> encoding of both data and indexes in the database.
>
> A C/C++ or Java UTF-16-based client receives this result set and must
> compare it to a large collection of data stored locally in UTF-16...
> </quote>
>
> This has to assume a closed system in which the server and client are
> proprietary solutions using proprietary protocols for their interaction. I
> say this because both are assuming Unicode is always binary sorted in an
> order that results from UTF-16, and that's a proprietary assumption. To
> make it otherwise either would require obsoleting UTF-8 and replacing it
> with UTF-8S, or else would require making UTF-8S an *official Unicode
> standard* protocol. So, they can't waffle on the status. If they don't want
> real official standard status for this, then so much for open solutions in
> which my client can talk to your server, or vice versa.
>
> If they want to do this in a closed system, they can already just go ahead
> an do it; they don't need UTC to give permission for what they do inside
> their own systems. By proposing that this be documented and given a name,
> evidently they want to be able to share the assumptions involved with
> others, i.e. do this in an open context. Thus, even if they don't call it a
> "standard Unicode encoding form", they're trying to treat it as such. So,
> it seems to me that this proposal really is asking us to create new,
> standardised encoding forms that need to be documented as UAXs. Either that
> or to adopt Drastic Measure 1 or 2. (I don't think DM1/2 would be
> considered for a moment by anybody, and Oracle et al explicitly rule that
> out in their proposal.)
>
> So, we're left with them asking us all to adopt a couple of additional
> standard encoding forms. Do we really want five encoding forms (and eleven
> encoding schemes)?
>
> Even if we go along with this, there's still a problem: That UTF-8 DB
> server in the scenario above (assuming a non-closed system) might be using
> true UTF-8, or UTF-8S. (I'm sure there must be existing implementations of
> clients or servers using UTF-16 and of clients or servers using real
> UTF-8.) So, the proposal require not only two new encoding forms; in
> addition the following are also necessary:
>
> - A way to communicate between client and server what binary sorting
> assumptions are being made.
> - Both the client and server *still* need to be able to handle the
> situation in which one is using UTF-16 and the other is using true UTF-8.
>
> So, the proposed solution (following the spirit of the proposal) doesn't
> eliminate the problem.
>
> To summarise:
>
> - Oracle et al want UTC to sanction two new encoding forms.
>
> - These encoding forms would supposedly have some kind of ambguous,
> quasi-official status.
>
> - Making the proposal accomplish anything in open systems really requires
> that the encoding forms have official standard status.
>
> - Even so, the proposal does not eliminate the problem that it is supposed
> to be addressing.
>
> - The problem as stated (assuming Premise B) cannot be eliminated in open
> systems without taking very drastic and impossible measures.
>
> - The problem can be solved in closed systems without needing new encoding
> forms sanctioned by UTC.
>
> This whole basis of the problem hinges on Premise B. If we maintain Premise
> B, then we end up with a situation that can in principle only be solved in
> closed systems and, as such, don't require any new UTC-sanctioned encoding
> forms (with whatever status). The attempt to solve the problem does not, in
> fact, eliminate the problem, and gives us new encoding forms to worry
> about.
>
> The alternative is to reject premise B. That seems to me to be *a whole
> lot* cleaner and easier.
>
> The main point seems to be that Oracle et al want to maintain Premise B,
> presumably because they think it would be easier. Yet I think I've shown
> that it doesn't, both because it creates new encoding forms to deal with,
> and because we still have to deal with the reality that the original
> encoding forms still exist. Now, I'm not a database developer, so I need to
> be careful since I can't presume to know the particular implementation
> needs of such environments. But it seems to me that we've lived without
> Premise B in the past, and that it won't benefit us to adopt it now. Why
> bother with it? Why not continue doing what we already know how to do?
>
> The only possible answer I can think of is out of concern that, in the case
> of Unicode, some implementers may *assume* premise B to be true. Our
> options, therefore, are twofold: to make Premise B in fact true -- but
> we've seen that that's the harder road and doesn't benefit us after all --
> or to make people understand that Premise B is false. People already need
> to learn about Unicode in order to implement it; why can't they also learn
> that Premise B is false? (This seems too easy; I must be missing
> something.)
>
> - Peter
>
> ---------------------------------------------------------------------------
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>

Next message: Y M Chan: "Re: Unicode&Chinese"
Previous message: Richard Cook: "Re: name this hanzi"
In reply to: Peter_Constable@sil.org: "Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Next in thread: Antoine Leca: "Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Reply: Antoine Leca: "Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Reply: Michael \(michka\) Kaplan: "Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT