RE: PDUTR #26 posted

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon Sep 17 2001 - 15:01:44 EDT


Mark,

> - Just because it is in IANA does *not* mean that everyone will
> support it.
> There are many encodings in IANA supported by very few people. Nor does it
> mean that it is intended for widespread public use. The IANA registry is
> also used as a general purpose registry, even for encodings that have
> limited or restricted use.

True, but even if it does not have widespread use, it is a PUBLIC character set and is intended for some public communications.

>
> - A significant reason for CESU-8 garnering enough support was that its
> introduction allows the definition of UTF-8 itself to be tightened, to
> formally exclude the 3-byte surrogates both in reading and writing.

I do not understand you point.

From TR27:

"The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as a transformation of Unicode characters. ISO/IEC 10646 does not allow mapping of unpaired surrogates, nor U+FFFE and U+FFFF (but it does allow other noncharacters)."

CESU-8 is currently a non-compliant UTF-8 variant that is illegal to use in 3.1 compliant software. If a user does not upgrade their UCS-2 software they can still be Unicode compliant with older versions of Unicode that do not support the non-BMP characters.

If you accept CESU-8 then you are providing two divergent in incompatible standards.

If a company does not use a private protocol outside of their own software then they can do anything that they want. There is no need for Unicode to do anything. The only reason that you might get involved is that different companies will use this standard and all have to implement the protocol in the same way. This by definition is a public standard.

I suspect that the only reason that the committee has not rejected the proposal out of hand is that they acknowledge that there is a problem. I suspect the Peoplesoft is not the only company with this problem.

I feel that we need to do two things. Help people migrate and end up with a single compatible standard.

First I think that we need to promote code point ordering support of applications that may do UTF transforms. We need to disseminate code like Markus's code point order routines. Because I support dynamic Unicode transforms in xIUA, I use code point ordering as the default either as supplied by ICU or using my own implementations derived from ICU code.

Second because the problem is that many systems still do not fully support planes. We could amend the UCS-2 character set to exclude the surrogate range as noncharacters. We could then amend CESU-8 to exclude surrogates as well. It would become a subset of UTF-8 (1 to 3 byte sequences only) that would work for BMP characters only. By using a CESU-8 or UCS-2 character set you would warn any process that communicates with your application that you only support BMP characters. This would be a very useful public standard.

Carl
 



This archive was generated by hypermail 2.1.2 : Mon Sep 17 2001 - 13:53:24 EDT